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Summary of Basic Rules for Probability Theory 

“Probability theory is nothing but common sense reduced to calculation” 
(Laplace). 

Introduction 


This module was adapted from E.T. Jaynes’ manuscript entitled: 
“Probability Theory with Applications to Science and Engineering — A 
Series of Informal Lectures”, 1974. The entire manuscript is available at 


A second and significantly expanded edition of this manuscript is available 
on Amazon. The first 3 chapters of the second edition are available here 


Deductive Logic (Boolean Algebra) 


Denote propositions by A, B, etc., their denials by A,, B, etc. Define the 
logical product and logical sum by 


~ “Both A and B are true” 


A-+B=. a » 
At least one of the propositions, A, B are true 


Deductive reasoning then consists of applying relations such as 
A+A=A; 
A(B+C) = (AB) + (AC); 


if D = A,B,then D, = A+ B. 


Inductive Logic (Probability Theory) 


Inductive logic is the extension of deductive logic, describing the reasoning 
of an idealized “robot”, who represents degrees of plausibility of a logical 
proposition by real numbers: 


p(A | B)= probability of A, given B. 


We use the original term “robot” advocated by Jaynes, it is intended to 
mean the use of inductive logic that follows a set of consistent rules that can 
be agreed upon. In this formulation of probability theory, conditional 
probabilities are fundamental. The elementary requirements of common 
sense and consistency determine these basic rules of reasoning (see Jaynes 
for the derivation). 


In these rules, one can think of the proposition C’ being the prior 
information that is available to assign probabilities to logical propositions, 
but these rules are true without this interpretation. 


Rule 1: p(AB | C) = p(A | BC)p(B | C) = p(B | AC)p(A | C) 
Rule 2: p(A | B) + p(A,. | B) =1 
Rule 3: p(A+ B|C) = p(A|C)+ p(B | C) — p(AB | C) 


Rule 4: If {A1,...A}are mutually exclusive and exhaustive, and 
information Bis indifferent to tem; i.e. if B gives no preference to one over 
any other then: 


p(A; | B) =1/n,i = 1...n (principle of insufficient reason) 


From rule 1 we obtain Bayes’ theorem: 


p(BIAC) 


p(A | BC) = p(A | C) p(BIC) 


From Rule 3, if {Ai,... Ay }are mutually exclusive, 
p(Ai +... An | B) = 3-1 p(Ai | B) 


If in addition, the A,;are exhaustive, we obtain the chain rule: 


p(B | C) = 1 p(BA; | C) = Yi-1 p(B | AiC)p(A; | C) 


Prior Probabilities 


The initial information available to the robot at the beginning of any 
problem is denoted by X. p(A | X)is then the prior probability of A. 
Applying Bayes’ theorem to take account of new evidence Fyields the 
posterior probability p(A | EX). In a posterior probability we sometimes 
leave off the X for brevity: p(A | EF) = p(A | EX). 


Prior probabilities are determined by Rule 4 when applicable; or more 
generally by the principle of maximum entropy. 


Decision Theory 

Enumerate the possible decisions D,,...D,and introduce the loss function 
L(D;,6;)representing the “loss” incurred by making decision D,if 6;is the 
true state of nature. After accumulating new evidence E, make that decision 
D;which minimizes the expected loss over the posterior distribution of 0; : 


Choose the decision D;which minimizes (L); = >), L(Di,9;)p(6; | EX) 


chooseD;such that is minimized 


Decibel scale with signal processing applications 


Introduction 


The concept of decibel originates from telephone engineers who were 
working with power loss in a telephone line consisting of cascaded circuits. 
The power loss in each circuit is the ratio of the power in to the power out, 
or equivivalently, the power gain is the ratio of the power out to the power 
in. 


Let P,, be the power input to a telephone line and P,,; the power out. The 
power gain is then given by 
Equation: 


Taking the logarithm of the gain formula we obtain a comparative measure 
called Bel. 


Note: Gain (Bel) = log Ft 


This measure is in honour of Alexander G. Bell, see [link]. 


Alexander G. Bell 


Decibel 


Bel is often a to large quantity, so we define a more useful measure, decibel: 
Equation: 


Pou 
Gain (dB) = 10 log a é 


in 


Please note from the definition that the gain in dB is relative to the input 
power. In general we define: 
Equation: 


Number of decibels = 10 log 


ref 


If no reference level is given it is customary to use P,ep = 1 W, in which 
case we have: 


Note: Number of decibels = 10 log P 


Example: 

Given the power spectrum density (psd) function of a signal x(n), S;x(zf). 
Express the magnitude of the psd in decibels. 

We find S,x(dB) = 10 log |$,x(if)|. 


More about decibels 


Above we’ve calculated the decibel equivalent of power. Power is a 
quadratic variable, whereas voltage and current are linear variables. This 


can be seen, for example, from the formulas P = v and P = I?R. 


So if we want to find the decibel value of a current or voltage, or more 
general an amplitude we use: 
Equation: 
Amplitude 
Amplitude (dB) = 20 log Rises sata 
Amplitude, .¢ 


This is illustrated in the following example. 


Example: 

Express the magnitude of the filter H(if) in dB scale. 
The magnitude is given by |H(zf)|, which gives: 
|H{(dB)| = 20 log |HI(éf)|. 


Plots of the magnitude of an example filter |H(if)| and its decibel 
equivalent are shown in [link]. 
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Magnitude responses. 


Some basic arithmetic 


The ratios 1,10,100, 1000 give dB values 0 dB, 10 dB, 20 dB and 30 dB 
respectively. This implies that an increase of 10 dB corresponds to a ratio 
increase by a factor 10. 


This can easily be shown: Given a ratio R we have R[dB] = 10 log R. 
Increasing the ratio by a factor of 10 we have: 10 log (10*R) = 10 log 10 + 
10 log R=10dB+RdaB. 


Another important dB-value is 3dB. This comes from the fact that: 


An increase by a factor 2 gives: an increase of 10 log 2* 3 dB. A 
“increase” by a factor 1/2 gives: an “increase” of 10 log 1/2 * -3 dB. 


Example: 

In filter terminology the cut-off frequency is a term that often appears. 
The cutoff frequency (for lowpass and highpass filters), f;, is the frequency 
at which the squared magnitude response in dB is %. In decibel scale this 
corresponds to about -3 dB. 


Decibels in linear systems 


In signal processing we have the following relations for linear systems: 
Equation: 


Y(if) = H(if)X(if) 


where X and H denotes the input signal and the filter respectively. Taking 
absolute values on both sides of [link] and converting to decibels we get: 


Note: The output amplitude at a given frequency is simply given by the 
sum of the filter gain and the input amplitude, both in dB. 


Other references: 


Above we have used P,e¢ = 1 W as a reference and obtained the standard 
dB measure. In some applications it is more useful to use P,.¢ = 1 mW and 
we then have the dBm measure. 


Another example is when calculating the gain of different antennas. Then it 
is customary to use an isotropic (equal radiation in all directions) antenna as 
a reference. So for a given antenna we can use the dBi measure. (i -> 
isotropic) 


Matlab files 


filter_example.m 


Foundations of Probability Theory: Basic Definitions 


Basic Definitions 


The basis of probability theory is a set of events - sample space - and a 
systematic set of numbers - probabilities - assigned to each event. The key 
aspect of the theory is the system of assigning probabilities. Formally, a 
sample space is the set {2 of all possible outcomes w; of an experiment. An 
event is a collection of sample points w; determined by some set-algebraic 
rules governed by the laws of Boolean algebra. Letting A and B denote 
events, these laws are 


Union:AU B= {w|weE A V we B} 
Intersection:AN B= {wlwe A A we B} 
Complement: A’ = {w|w ¢ A} 
(AUB)'=A'NB 


The null set 0 is the complement of 92. Events are said to be mutually 
exclusive if there is no element common to both events: AM B = @. 


Associated with each event A; is a probability measure Pr|A;], sometimes 
denoted by 7, that obeys the axioms of probability. 


e Pr[Q] = 1 
e If AN B = Q, then Pr|A U B] = Pr[A] + Pr[B). 


The consistent set of probabilities Pr|-] assigned to events are known as the 
a priori probabilities. From the axioms, probability assignments for 
Boolean expressions can be computed. For example, simple Boolean 
manipulations (A U B = AU A’B) lead to 

Equation: 


Pr[A U B] = Pr[A] + Pr[B] — Pr[An BI 


Suppose Pr/B] 4 0. Suppose we know that the event B has occurred; what 
is the probability that event A has also occurred? This calculation is known 
as the conditional probability of A given B and is denoted by Pr[A | BJ. 
To evaluate conditional probabilities, consider B to be the sample space 
rather than §2. To obtain a probability assignment under these circumstances 
consistent with the axioms of probability, we must have 

Equation: 


Pr[An B 


Pr|A | B|= 


The event is said to be statistically independent of B if 

Pr[A | B] = Pr|A}: the occurrence of the event B does not change the 
probability that A occurred. When independent, the probability of their 
intersection Pr[A  B] is given by the product of the a priori probabilities 
Pr[A] Pr|B]. This property is necessary and sufficient for the independence 


of the two events. As Pr[A | B] = eA aad Pr[B | Al Pr[ANB| 


Pr[B] — “PyfAy ? 
obtain Bayes' Rule. 
Equation: 


Pr[A | B] Pr[B| 


Pr[B | A] = Sa 


Review of Probability Theory 


The focus of this course is on digital communication, which involves transmission of 
information, in its most general sense, from source to destination using digital technology. 
Engineering such a system requires modeling both the information and the transmission 
media. Interestingly, modeling both digital or analog information and many physical 
media requires a probabilistic setting. In this chapter and in the next one we will review 
the theory of probability, model random signals, and characterize their behavior as they 
traverse through deterministic systems disturbed by noise and interference. In order to 
develop practical models for random phenomena we start with carrying out a random 
experiment. We then introduce definitions, rules, and axioms for modeling within the 
context of the experiment. The outcome of a random experiment is denoted by w. The 
sample space £2 is the set of all possible outcomes of a random experiment. Such 
outcomes could be an abstract description in words. A scientific experiment should indeed 
be repeatable where each outcome could naturally have an associated probability of 
occurrence. This is defined formally as the ratio of the number of times the outcome 
occurs to the total number of times the experiment is repeated. 


Random Variables 


A random variable is the assignment of a real number to each outcome of a random 
experiment. 


X(@) 


Example: 

Roll a dice. Outcomes {w , w, w3, W4, W5, We } 
Ww; = 12 dots on the face of the dice. 

X(w;) =) 


Distributions 


Probability assignments on intervals a < X < b 


Cumulative distribution 
The cumulative distribution function of a random variable X is a function 
Fy (R+ R) such that 


Equation: 
Fy (6) = Pr[x <5] 
= Pri{w € Q| X(w) < b}| 
X(0) 
———— > 
Q : ss 


Continuous Random Variable 
A random variable X is continuous if the cumulative distribution function can be 
written in an integral form, or 
Equation: 


and fx (x) is the probability density function (pdf) (e.g., Fx (x) is differentiable and 
fy (x) = 4, (Fx (2))) 


Discrete Random Variable 
A random variable X is discrete if it only takes at most countably many points (i.e., 
F x (-) is piecewise constant). The probability mass function (pmf) is defined as 
Equation: 


Pr|X = a; 
Fy (xx) — limit Fy (az) 


xu(a—>az) A (a@<apz) 


Px (xx) 


Two random variables defined on an experiment have joint distribution 


Equation: 


Fy, y (a,b) = Pr[xX <a,Y <) 
Pri{w € 2| (X(w) <a) A (Y(w) < bp} 


Y 
| (a,b) 


_——————— 


Joint pdf can be obtained if they are jointly continuous 


Equation: 
b a 
Fy y (a, b) = / ‘| fxy (x, y) dad y 
@Fy y(z, 
(e.g., fxy (2, y) = SEH) ) 


Joint pmf if they are jointly discrete 
Equation: 


PxY (2a, yi) = Pr[X =2,,Y = yi] 
Conditional density function 
Equation: 


fxy (x, y) 


fx (a) 


fyjx(ylz) = 


for all x with fx (x) > 0 otherwise conditional density is not defined for those values of x 
with fx (z) = 0 


Two random variables are independent if 
Equation: 


fx,y (x,y) =fx (x) fy (y) 
for all z € Randy € R. For discrete random variables, 
Equation: 


Pxy (te, yi) =Px (xe) Py (y) 


for all k and l. 


Moments 


Statistical quantities to represent some of the characteristics of a random variable. 
Equation: 


g(X) 


El9(X)] 
g(x) fx (x) da if continuous 
5, 9(@k) Px (xx) if discrete 


e Mean 
Equation: 
Wx =X 
e Second moment 
Equation: 
El X?|= xX? 
e Variance 


Equation: 


e Characteristic function 
Equation: 


®x(u) = em 


for u € R, where i = /—1 
Correlation between two random variables 
Equation: 


Ryy = RY 
| nen (ae ny fyy (a,y)dady if Xand Y are jointly continuous 


a CRY) pxy (tz, yi) if X and Y are jointly discrete 


e Covariance 
Equation: 


Cxy = Cov CX, y] 


= (X-ypx)(¥—py)" 


= Rxyy — uxby 
e Correlation coefficient 
Equation: 
Cov (X,Y) 
PxXY = 
OxXxOy 


Uncorrelated random variables 
Two random variables X and Y are uncorrelated if pxy = 0. 


Review of Linear Algebra 


Vector spaces are the principal object of study in linear algebra. A vector 
space is always defined with respect to a field of scalars. 


Fields 


A field is a set F’ equipped with two operations, addition and mulitplication, 
and containing two special members 0 and 1 (0 ~ 1), such that for all 
{a,b,cleF 


1 l(ea+beF 
2.a+b=b+a 
3.(a+b)+c=a+(b+c) 
4.a+0=a 
5. there exists —a such that a ++ —a = 0 


labe F 

2.ab = ba 

3. (ab)c = a (bc) 

AG. 6 

5. there exists a! such that aa~! = 1 


3.a(b+c)=ab+ac 
More concisely 
1. F is an abelian group under addition 


2. Fis an abelian group under multiplication 
3. multiplication distributes over addition 


Examples 


Q,R,C 


Vector Spaces 


Let F' be a field, and V a set. We say V is a vector space over F if there 
exist two operations, defined for alla € F,uw € V andv € V: 


e vector addition: (u,v) - (w+v) EV 
e scalar multiplication: (a,v) > av € V 


and if there exists an element denoted 0 € V, such that the following hold 
foralac F,b€ Fyanduwe V,veEV,andweVv 


1 lut+(v+w)=(ut+v)+w 
2.uU+v=vU+U 


3.u+0=u 
4. there exists —w such that w+ —u — 0 


More concisely, 


1. V is an abelian group under plus 
2. Natural properties of scalar multiplication 


Examples 


¢ RY” isa vector space over R 

¢ C% is a vector space over C 
Cer R 

° is a vector space over 

¢ RY” is not a vector space over C 


The elements of V are called vectors. 


Euclidean Space 


Throughout this course we will think of a signal as a vector 


The samples {;} could be samples from a finite duration, continuous time 
signal, for example. 


A signal will belong to one of two vector spaces: 


Real Euclidean space 


az € RY (over R) 


Complex Euclidean space 


x € CNX (over C) 


Subspaces 
Let V be a vector space over F’. 


A subset S C V is called a subspace of V if S is a vector space over F' in 
its own right. 


Example: 
V = R?, F=R, S = any line though the origin. 


Ay 


—> X 


A 


S is any line through the origin. 


Are there other subspaces? 


S C V is a subspace if and only if for all a € F and b € F and for all 
s€Sandt€S,(as+bt)eS 


Linear Independence 
Let u1,...,u, € V. 


We say that these vectors are linearly dependent if there exist scalars 
a1,...,@a,% € F such that 


Equation: 
k 
> aQa{,Uy = 0 
i=l 


and at least one a; # 0. 


If [link] only holds for the case ay = ... = az = O,7 we say that the vectors 
are linearly independent. 


Example: 
1 —2 =o 
UP ee Se i) 
2 0 —2 


so these vectors are linearly dependent in R°. 


Spanning Sets 


Consider the subset S = {v1,v2,...,v%}. Define the span of S 


wer 


k 
< S$ >=span(S) = {Soa 
i=l 


Fact: < S > is a subspace of V. 


Example: 


V=R? F-R, S= 101,00},07 = 10 |,0 = |) 1)/> 


< S$ >= xy-plane. 


< S$ > is the xy-plane. 


Aside 


If S is infinite, the notions of linear independence and span are easily 
generalized: 


We say S is linearly independent if, for every finite collection 
U1,---,UK € S, (k arbitrary) we have 


k 
i=1 


The span of Sis 


k 
a 15 = {oa 


dl 


acFAwES A ie<c)| 


Note:In both definitions, we only consider finite sums. 


Bases 


A set B C V is called a basis for V over F if and only if 


1. B is linearly independent 
2-<—b = 


Bases are of fundamental importance in signal processing. They allow us to 


decompose a signal into building blocks (basis vectors) that are often more 
easily understood. 


Example: 
V = (real or complex) Euclidean space, R® or CX. 


B= {ej,...,en} = standard basis 


where the 1 is in the i*" position. 


Example: 
V =CN% over C. 


which is the DFT basis. 


Uk = 


where 2 = VY —l. 


Key Fact 


If Bis a basis for V, then every v € V can be written uniquely (up to order 
of terms) in the form 


where a; € F' and v; € B. 


Other Facts 


e If Sis a linearly independent set, then S' can be extended to a basis. 
e If < S >= V, then S contains a basis. 


Dimension 


Let V be a vector space with basis B. The dimension of V, denoted 
dim (V), is the cardinality of B. 


Every vector space has a basis. 
Every basis for a vector space has the same cardinality. 
= dim (V) is well-defined. 


If dim (V) < oo, we say V is finite dimensional. 


Examples 
vector space field of scalars dimension 
RY R 
Cy C 
ce R 


Every subspace is a vector space, and therefore has its own dimension. 


Example: 
Suppose (S = {u1,...,ux}) C V isa linearly independent set. Then 


came S25) 


Facts 


e If S is a subspace of V, then dim (S) < dim(V). 
e Ifdim(S) = dim(V) < o, thn S=V. 


Direct Sums 
Let V be a vector space, and let S C V and T C V be subspaces. 


We say V is the direct sum of S and 7’, written V = S @ T, if and only if 
for every v © V, there exist unique s € S andt € TJ’ such that v = s+ f. 


If V = S @T, then T is called a complement of S. 


Example: 
V=C' ={f:R—RJfis continuous} 
S = even funcitons inC’ 


T = odd funcitons inC’ 
il i 


BA ee) na ae) Glaciog IN) lilt) 


heap Sevohes eso) Aa oe Seley Ge ee We sie ee ion 
g—g' =h' — his odd and even, which implies g = g’ andh = h’. 


Facts 


1. Every subspace has a complement 
2.V = S@T if and only if 


LSor= {ot 
2.4 5.1 S=V 


3.1fV = S@T, and dim (V) < ov, then 
dim (V) = dim (S$) + dim (T) 


Proofs 


Invoke a basis. 


Norms 


Let V be a vector space over F’. A norm is amapping V — F’, denoted by 
|| + ||, such that forallu €V,v Ee V,andAcCF 


1. || uw ||> Oifu ~0 
2. || Aw ||= |A| || e || 
3. [utes ull+| oll 


Examples 


Euclidean norms: 


2 eR: 


xecn: 


Induced Metric 
Every norm induces a metric on V 
d(u,v) =||u—v | 


which leads to a notion of "distance" between vectors. 


Inner products 


Let V be a vector space over F’, F = Ror C. An inner product is a 
mapping V x V -—> F, denoted (-, -), such that 


1. (v,v) > 0, and (v,v) =O Sv=0 
2. (U,V) = (v,u) 
3. (au + bv, w) = a ((u,w)) + b((v, w)) 


Examples 
RY” over R: 
N 
(x,y) c= (a7 y) = So iy: 
i=l 
CX over C: 


8 
I 


is called the "Hermitian," or "conjugate transpose" of a. 


Triangle Inequality 
If we define || w ||= (w, w), then 
wt v |[<l] ul] + |e | 


Hence, every inner product induces a norm. 


Cauchy-Schwarz Inequality 
ForallweV,ve V, 
(u,v)| <|[ w |] || v || 


In inner product spaces, we have a notion of the angle between two vectors: 


(tu v) = arceos( 7) |) € [0, 2m) 


| w | |e || 


Orthogonality 
uw and v are orthogonal if 

(u,v) =0 
Notation: wu L v. 


If in addition || w |/=|| v ||= 1, we say wu and v are orthonormal. 


In an orthogonal (orthonormal) set, each pair of vectors is orthogonal 
(orthonormal). 


Ay 


= X 


Orthogonal vectors in R?. 


Orthonormal Bases 
An Orthonormal basis is a basis {v; } such that 


lifi=j 


a 
Min Vi) = 949 ree 


Example: 
The standard basis for R% or C” 


Example: 
The normalized DFT basis 


Uk = 


ue 
VN 


Expansion Coefficients 


If the representation of v with respect to {v;} is 
v= » AYU; 
i 
then 
GH Uw) 


Gram-Schmidt 


Every inner product space has an orthonormal basis. Any (countable) basis 
can be made orthogonal by the Gram-Schmidt orthogonalization process. 


Orthogonal Compliments 


Let S C V bea subspace. The orthogonal compliment S is 


St={ulueV A ((u,v) =0) A Vu: (ve S)} 
S'+ is easily seen to be a subspace. 


If dim (v) < 00, then V= SQ S". 


Note:If dim (v) = 00, then in order to have V = S$ @ S+ we require V to 
be a Hilbert Space. 


Linear Transformations 


Loosely speaking, a linear transformation is a mapping from one vector 
space to another that preserves vector space operations. 


More precisely, let V, W be vector spaces over the same field F’. A linear 
transformation is a mapping 7’: V — W such that 


T(au + bv) = aT(u) + bT(v) 
foralac F,b€ Fanduc V,veEV. 


In this class we will be concerned with linear transformations between (real 
or complex) Euclidean spaces, or subspaces thereof. 


Image 


image (T) = {w| we W A T(v) = wfor somev} 


Nullspace 


Also known as the kernel: 


ker (T) = {vj v EV A (T(v) = 90)} 


Both the image and the nullspace are easily seen to be subspaces. 


Rank 


rank (T) = dim (image (T)) 


Nullity 


null (7) = dim (ker (T')) 


Rank plus nullity theorem 


rank (TJ) + null (7) = dim (V) 


Matrices 


Every linear transformation T' has a matrix representation. If 
T:EX ~ E”“,E=RorC, then T is represented by an M x N matrix 


Q@M1 --- QA@MN 


where (aj;...ayi)’ = T(e;) and e; = (0...1...0)" is the i** standard 
basis vector. 


Note:A linear transformation can be represented with respect to any bases 
of EN and E™, leading to a different A. We will always represent a linear 
transformation using the standard bases. 


Column span 


colspan (A) =< A >=image (A) 


Duality 


If A: RY > R™, then 


ker+(A) =image (A*) 


HA<:<C* —-C™ , then 


ker~ (A) =image (A®) 


Inverses 


The linear transformation/matrix A is invertible if and only if there exists a 
matrix B such that AB = BA = I (identity). 


Only square matrices can be invertible. 
Let A: FX — F% be linear, F = R or C. The following are equivalent: 
1. A is invertible (nonsingular) 


2.rank(A) = N 
3. null(A) = 0 


4.det A #0 
5. The columns of A form a basis. 


If A~' = A? (or A# in the complex case), we say A is orthogonal (or 
unitary). 


Signal Power 


An interesting question you could ask about a signal is its average power. A 
signal's instantaneous power is defined to be its square, as if it were a voltage 
or current passing through a 1 Q resistor. The average power is the average of 
the instantaneous power over some time interval. For a periodic signal, the 
natural time interval is clearly its period; for nonperiodic signals, a better 
choice would be entire time or time from onset. For a periodic signal, the 
average power is the square of the root-mean-squared (rms) value. We define 
the rms value of a periodic signal to be 

Equation: 


and thus its average power is rms?(s). 


Equation: 
power (s) = rms*(s) 
= 4 5 s%(t)dt 
Exercise: 


Problem: What is the rms value of the half-wave rectified sinusoid? 


Solution: 


A half-wave rectified sinusoid has half the average power of the original 
sine wave since it is zero half the time. A sine wave's average power 

2 
equals — making the rms value of the half-wave rectified signal — 


To find the average power in the frequency domain, we need to substitute the 
spectral representation of the signal into this expression. 
Equation: 


2 
1 fe? oe Qrkt O° Qrkt 

ower (s) = — a +5 a,cos —— + b; sin = ——— dt 
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The square inside the integral will contain all possible pairwise products. 
integrate to zero. The survivors leave a rather simple expression for the power 
we seek. 

Equation: 


1 CO 
power (s) = ag” + > ay” + by? 
k 1 


Power Spectrum of a Half-Wave Rectified Sinusoid 
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Power spectrum of a half-wave rectified 
sinusoid. 


It could well be that computing this sum is easier than integrating the signal's 
square. Furthermore, the contribution of each term in the Fourier series toward 
representing the signal can be measured by its contribution to the signal's 
average power. Thus, the power contained in a signal at its k th harmonic is 
ant tbe . The power spectrum P,(k), such as shown in [link], plots each 
harmonic's contribution to the total power. 


Exercise: 


Problem: 


In stereophonic systems, deviation of a sine wave from the ideal is 
measured by the total harmonic distortion, which equals the total power in 
the harmonics higher than the first compared to power in the fundamental. 
Find an expression for the total harmonic distortion for any periodic 
signal. Is this calculation most easily performed in the time or frequency 
domain? 


Solution: 


Fy Ae? +b,” 
a,2+b,? 
most easily computed in the frequency domain. However, the numerator 
equals the the square of the signal's rms value minus the power in the 
average and the power in the first harmonic. 


Total harmonic distortion equals . Clearly, this quantity is 


Signal Energy vs. Signal Power 


The idea of the "size" of a signal is crucial to many applications. It is nice to 
know how much electricity can be used in a defibrillator without ill effects, 
for instance. It is also nice to know if the signal driving a set of headpones 
is enough to create a sound. While both of these examples deal with electric 
signals, they are clearly very different signals with very different tolerances. 
For this reason, it is convenient to quantify this idea of "size". This leads to 
the ideas of signal energy and signal power. 


Signal Energy 


Since we often think of signal as a function of varying amplitude through 
time, it seems to reason that a good measurement of the strength of a signal 
would be the area under the curve. However, this area may have a negative 
part. This negative part does not have less strength than a positive signal of 
the same size (reversing your grip on the paper clip in the socket is not 
going to make you any more lively). This suggests either squaring the 
signal or taking its absolute value, then finding the area under that curve. It 
turns out that what we call the energy of a signal is the area under the 
squared signal. 


f(t) |f(t)|? 


The energy of this signal is the shaded region. 


Equation: 


Signal Power 


Our definition of energy seems reasonable, and it is. However, what if the 
signal does not decay? In this case we have infinite energy for any such 
signal. Does this mean that a sixty hertz sine wave feeding into your 
headphones is as strong as the sixty hertz sine wave coming out of your 
outlet? Obviously not. This is what leads us to the idea of signal power. 


A simple, common signal with infinite 
energy. 


Power is a time average of energy (energy per unit time). This is useful 
when the energy of the signal goes to infinity. 
Equation: 


T 
2 


1 2 
P; = limit — 
f = limit 7 |, eon dt 
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We compute the energy per a specific unit of time, then 
allow that time to go to infinity. 


Energy 
1. Compute —;>"- 


2. Then look at limit fnerey 
T- 00 


P, is often called the mean-square value of f. ,/ Py is then called the root 
mean squared (RMS) value of f. 


Energy vs. Power 


e "Energy signals" have finite energy. 
¢ "Power signals" have finite and non-zero power. 


Exercise: 


Problem: Are all energy signals also power signals? 


Solution: 


No. In fact, any signal with finite energy will have zero power. 


Exercise: 


Problem: Are all power signals also energy signals? 


Solution: 


No, any signal with non-zero power will have infinite energy. 


Exercise: 


Problem: Are all signals either energy or power signals? 
Solution: 


No. Any infinite-duration, increasing-magnitude function will not be 
either. (eg f(t) = t is neither) 


Introduction to Statistical Signal Processing 


Digital Signal Processing 


e Digital = sampled, discrete-time, quantized 
e Signal = waveform, sequnce of measurements or observations 
e Processing = analyze, modify, filter, synthesize 


Examples of Digital Signals 


e sampled speech waveform 
e "pixelized" image 
¢ Dow-Jones Index 


DSP Applications 


e Filtering (noise reduction) 
e Pattern recognition (speech, faces, fingerprints) 
e Compression 


A Major Difficulty 


In many (perhaps most) DSP applications we don't have complete or perfect 
knowledge of the signals we wish to process. We are faced with many 
unknowns and uncertainties. 


Examples 


e noisy Measurements 

e unknown signal parameters 

e noisy system or environmental conditions 

e natural variability in the signals encountered 


Functional Magnetic Resonance Imaging 
[missing resource: FMRI.png] 


Challenges are 
measurement noise and 
intrinsic uncertainties in 

signal behavior. 


How can we design signal processing algorithms in the face of such 
uncertainty? 


Can we model the uncertainty and incorporate this model into the design 
process? 


Statistical signal processing is the study of these questions. 


Modeling Uncertainty 


The most widely accepted and commonly used approach to modeling 
uncertainty is Probability Theory (although other alternatives exist such as 
Fuzzy Logic). 


Probability Theory models uncertainty by specifying the chance of 
observing certain signals. 


Alternatively, one can view probability as specifying the degree to which 
we believe a signal reflects the true state of nature. 


Examples of Probabilistic Models 


¢ errors in a measurement (due to an imprecise measuring device) 
modeled as realizations of a Gaussian random variable. 

¢ uncertainty in the phase of a sinusoidal signal modeled as a uniform 
random variable on [0, 27). 

¢ uncertainty in the number of photons stiking a CCD per unit time 
modeled as a Poisson random variable. 


Statistical Inference 


A Statistic is a function of observed data. 


Example: 
Suppose we observe WN scalar values 21,..., yy. The following are 
Statistics: 


er=4 S~_, &n (sample mean) 
° 21,-..,Xy (the data itself) 

e min {z,,..., 2} (order statistic) 
° (x17 — £2 sin(x3), e '"!*)) 


A Statistic cannot depend on unknown parameters. 


Probability is used to model uncertainty. 


Statistics are used to draw conclusions about probability models. 


probability 


Population 
of measured 
“random” signal 
signals 


statistics 


Probability models our uncertainty about signals we may observe. 


Statistics reasons from the measured signal to the population of possible 
signals. 


Statistical Signal Processing 
e Step 1 Postulate a probability model (or models) that reasonably 
capture the uncertainties at hand 
e Step 2 Collect data 


e Step 3 Formulate statistics that allow us to interpret or understand our 
probability model(s) 


In this class 


The two major kinds of problems that we will study are detection and 
estimation. Most SSP problems fall under one of these two headings. 


Detection Theory 


Given two (or more) probability models, which on best explains the signal? 


Examples 


1. Decode wireless comm signal into string of 0's and 1's 
2. Pattern recognition 


© voice recognition 
o face recognition 
o handwritten character recognition 


3. Anomaly detection 


© radar, sonar 
© irregular, heartbeat 
© gamma-ray burst in deep space 


Estimation Theory 


If our probability model has free parameters, what are the best parameter 
settings to describe the signal we've observed? 


Examples 


1. Noise reduction 
2. Determine parameters of a sinusoid (phase, amplitude, frequency) 
3. Adaptive filtering 


o track trajectories of space-craft 
© automatic control systems 


o channel equalization 


4. Determine location of a submarine (sonar) 
5. Seismology: estimate depth below ground of an oil deposit 


Example: 


Detection Example 


Suppose we observe WN tosses of an unfair coin. We would like to decide 
which side the coin favors, heads or tails. 


e Step 1 Assume each toss is a realization of a Bernoulli random 
variable. 


Pr[Heads] = p = 1 — Pr[Tails| 
Must decide p = 4 Sa — a 
e Step 2 Collect data 71,...,2y 
x; = 1 = Heads 
x; —0= Tails 


e Step 3 Formulate a useful statistic 


N 
by er 


= 


lings Be 2, guess p = +.1fk > + guess p = 4. 


Example: 
Estimation Example 


Suppose we take N measurements of a DC voltage A with a noisy 
voltmeter. We would like to estimate A. 


e Step 1 Assume a Gaussian noise model 


L, =At+wu, 


where w, ~ (0,1). 
e Step 2 Gather data 71,...,2y 
e Step 3 Compute the sample mean, 


and use this as an estimate. 


In these examples (({link] and [link]), we solved detection and estimation 
problems using intuition and heuristics (in Step 3). 


This course will focus on developing principled and mathematically 
rigorous approaches to detection and estimation, using the theoretical 
framework of probability and statistics. 


Summary 


e DSP = processing signals with computer algorithms. 
e SSP = statistical DSP = processing in the presence of uncertainties and 
unknowns. 


Introduction to Random Signals and Processes 


Before now, you have probably dealt strictly with the theory behind signals 
and systems, as well as look at some the basic characteristics of signals and 
systems. In doing so you have developed an important foundation; however, 
most electrical engineers do not get to work in this type of fantasy world. In 
many cases the signals of interest are very complex due to the randomness 
of the world around them, which leaves them noisy and often corrupted. 
This often causes the information contained in the signal to be hidden and 
distorted. For this reason, it is important to understand these random signals 
and how to recover the necessary information. 


Signals: Deterministic vs. Stochastic 


For this study of signals and systems, we will divide signals into two 
groups: those that have a fixed behavior and those that change randomly. As 
most of you have probably already dealt with the first type, we will focus 
on introducing you to random signals. Also, note that we will be dealing 
strictly with discrete-time signals since they are the signals we deal with in 
DSP and most real-world computations, but these same ideas apply to 
continuous-time signals. 


Deterministic Signals 


Most introductions to signals and systems deal strictly with deterministic 
signals. Each value of these signals are fixed and can be determined by a 
mathematical expression, rule, or table. Because of this, future values of 
any deterministic signal can be calculated from past values. For this reason, 
these signals are relatively easy to analyze as they do not change, and we 
can make accurate assumptions about their past and future behavior. 
Deterministic Signal 


An example of a deterministic signal, the sine wave. 


Stochastic Signals 


Unlike deterministic signals, stochastic signals, or random signals, are not 
so nice. Random signals cannot be characterized by a simple, well-defined 
mathematical equation and their future values cannot be predicted. Rather, 
we must use probability and statistics to analyze their behavior. Also, 
because of their randomness, average values from a collection of signals are 
usually studied rather than analyzing one individual signal. 
Random Signal 
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0 filing, “liaios oA Ray at yy ; 


abl Lae? dl 


We have taken the above sine wave and added random 
noise to it to come up with a noisy, or random, signal. 
These are the types of signals that we wish to learn how 
to deal with so that we can recover the original sine 
wave. 


Random Process 


As mentioned above, in order to study random signals, we want to look at a 
collection of these signals rather than just one instance of that signal. This 
collection of signals is called a random process. 


random process 


A family or ensemble of signals that correspond to every possible 
outcome of a certain signal measurement. Each signal in this collection 
is referred to as a realization or sample function of the process. 


Example: 

As an example of a random process, let us look at the Random Sinusoidal 
Process below. We use f[n] = Asin(wn + y) to represent the sinusoid 
with a given amplitude and phase. Note that the phase and amplitude of 
each sinusoid is based on a random number, thus making this a random 
process. 


Random Sinusoidal Process 
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A random sinusoidal process, with the amplitude and 
phase being random numbers. 


A random process is usually denoted by X(t) or X[n], with x(t) or x[n] 
used to represent an individual signal or waveform from this process. 


In many notes and books, you might see the following notation and terms 
used to describe different types of random processes. For a discrete 
random process, sometimes just called a random sequence, ¢ represents 
time that has a finite number of values. If ¢ can take on any value of time, 
we have a continuous random process. Often times discrete and 
continuous refer to the amplitude of the process, and process or sequence 
refer to the nature of the time variable. For this study, we often just use 
random process to refer to a general collection of discrete-time signals, as 
seen above in [link]. 


Introduction to Stochastic Processes 


Definitions, distributions, and stationarity 


Stochastic Process 
Given a sample space, a stochastic process is an indexed collection of random variables defined for each 
we 92. 
Equation: 


Vt,t ER: (X;(w)) 


Example: 
Received signal at an antenna as in [link]. 
Sample Paths 


For a given t, X;(w) is a random variable with a distribution 
Equation: 
First-order distribution 


Fx,(b) = Pr[X; < } 
Pri{w € 2| Xz(w) < b}] 


First-order stationary process 
If F'x,(0) is not a function of time then X; is called a first-order stationary process. 


Equation: 
Second-order distribution 


Ee ae (b1, ba) = Pr[ Xz, < by, Xt, < bo] 


forallt; EC R,t €R,b, CR, bo ECR 
Equation: 


Nth-order distribution 


Pe este (ig Dds ed ., by) = Pr[ Xz, < bi, ene »» Xty < by] 


Nth-order stationary : A random process is stationary of order N if 
Equation: 


Fi Ky son Mee (OT) bo,...,bnw) = ba eae et eee CNA Uae bo,..., bn) 


Strictly stationary : A process is strictly stationary if it is Nth order stationary for all NV. 


Example: 
X, = cos(2afot + O(w)) where fo is the deterministic carrier frequency and O(w) : 2 — Ris arandom 
variable defined over |—7,, 7] and is assumed to be a uniform random variable; i.e., 


Sit 
fo(0) = { 21 [ | 
0 otherwise 
Equation: 
Fx,(b) = Pr[X; < 5] 
= Pricos(2rfot + O) < d 
Equation: 
F,(b) = Pr[—a < 2fot + O < —arccos(b)| + Priarccos(b) < 2afot + O < 7] 
Equation: 
ba —arccos(b))—27fot m—2rfot 
Fx,(0) = Apes Se ; = piel OS eea(a) —2rfot a dé 
= (2r-2 Oa x 
Equation: 
a= a(t — = arccos(z)) 
_ a Helcot 


0 otherwise 


This process is stationary of order 1. 


Plots of Cosines with different Phases and the same frequency 
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The second order stationarity can be determined by first considering conditional densities and the joint 
density. Recall that 
Equation: 


X; = cos(2rfot + O) 


Then the relevant step is to find 


Equation: 
Pree ab) | — 
Note that 
Equation: 
(Xi, = x1 = cos(2rfot + O)) > (O = arccos(x1) — 27 fot) 
Equation: 


Xt, = cos(2rfot2 + arccos(x1) — 2a fot1) 
cos(2m fo (t2 — t1) + arccos(x1)) 


Pr[X,,<b,| X,=x, ] 


cos(2Th(t, - ty) + cos X,) - 


Equation: 


by 
Fx,,,x,, (b2, b1) = fx,,(@1) Pr[Xt, < be | Xt, = 21] d x1 


Note that this is only a function of t2 — f. 


Example: 
Every T seconds, a fair coin is tossed. If heads, then X; = 1 fornT <t < (n+ 1)T. If tails, then 
X, = —lfornT <t < (n+1)T. 


X, f Sample function 


Equation: 


N|F wR 


Px (2) = 


for allt € R. X; is stationary of order 1. 
Second order probability mass function 
Equation: 


PX,,X.,(€1, ©2) = Px,,|x,, (€2]@1)Px,, (#1) 


The conditional pmf 
Equation: 
0 if L2 z= Ly 


PX,,|x,, 2101) a : if t2= 21 


when nT < t) < (n+ 1)T and nT < tz < (n+ 1)T for some n. 
Equation: 


Px,,|x,,(@2|@1) = Px., (#2) 


for all x; and for all 2 when nT < t; < (n+1)T andmT < tz < (m+1)T withn#m 
Equation: 
0 if r2 A axjfor nT < ti, te < (n+1)T 
eve ceo px, (t1) if a2 = afornT < ti, te < (n+1)T 
Px, (£1)px,,(t2) if n A mfor (nT < t) < (n+1)T) A (mT < ty < (m+1)T) 


The Gaussian Process 


A random process X(t) is Gaussian if the joint density of the N amplitudes 
{X(t1),...,, X(t) } comprise a Gaussian random vector. The elements of 
the required covariance matrix equal the covariance between the 
appropriate amplitudes: K;,; = Kx(t:,t;). Assuming the mean is known, 
the entire structure of the Gaussian random process is specified once the 
correlation function or, equivalently, the power spectrum is known. As 
linear transformations of Gaussian random processes yield another 
Gaussian process, linear operations such as differentiation, integration, 
linear filtering, sampling, and summation with other Gaussian processes 
result in a Gaussian process. 


Sampling and Random Sequences 


The usual Sampling Theorem applies to random processes, with the 
spectrum of interest beign the power spectrum. If stationary process X(t) is 
bandlimited - Ax(w) = 0, |w| > W, as long as the sampling interval T 
satisfies the classic constraint T < 7; the sequence X(/T) represents the 


original process. A sampled process is itself a random process defined over 
discrete time. Hence, all of the random process notions introduced in the 


previous section apply to the random sequence x (1) = X(IT). The 
correlation functions of these two processes are related as 


R;(k) = B| X()X( " k) = Rx(kT) 


We note especially that for distinct samples of a random process to be 
uncorrelated, the correlation function Rx(kT’) must equal zero for all non- 
zero k. This requirement places severe restrictions on the correlation 
function (hence the power spectrum) of the original process. One 
correlation function satisfying this property is derived from the random 
process which has a bandlimited, constant-valued power spectrum over 
precisely the frequency region needed to satisfy the sampling criterion. No 
other power spectrum satisfying the sampling criterion has this 
property. Hence, sampling does not normally yield uncorrelated 
amplitudes, meaning that discrete-time white noise is a rarity. White noise 
has a correlation function given by Rz(k) = o76(k), where 6(-) is the unit 


sample. The power spectrum of white noise is a constant: S(w) =e". 


Stationary and Nonstationary Random Processes 


Introduction 


From the definition of a random process, we know that all random 
processes are composed of random variables, each at its own unique point 
in time. Because of this, random processes have all the properties of 
random variables, such as mean, correlation, variances, etc.. When dealing 
with groups of signals or sequences it will be important for us to be able to 
show whether of not these statistical properties hold true for the entire 
random process. To do this, the concept of stationary processes has been 
developed. The general definition of a stationary process is: 


stationary process 
a random process where all of its statistical properties do not vary with 
time 


Processes whose statistical properties do change are referred to as 
nonstationary. 


Understanding the basic idea of stationarity will help you to be able to 
follow the more concrete and mathematical definition to follow. Also, we 
will look at various levels of stationarity used to describe the various types 
of stationarity characteristics a random process can have. 


Distribution and Density Functions 


In order to properly define what it means to be stationary from a 
mathematical standpoint, one needs to be somewhat familiar with the 
concepts of distribution and density functions. If you can remember your 
Statistics then feel free to skip this section! 


Recall that when dealing with a single random variable, the probability 
distribution function is a simply tool used to identify the probability that 
our observed random variable will be less than or equal to a given number. 
More precisely, let X be our random variable, and let z be our given value; 
from this we can define the distribution function as 


Equation: 


Fi(¢) = Prix <2] 


This same idea can be applied to instances where we have multiple random 
variables as well. There may be situations where we want to look at the 
probability of event X and Y both occurring. For example, below is an 
example of a second-order joint distribution function. 

Equation: 


F,(¢,4) =Pr|xX <2, Y <4 


While the distribution function provides us with a full view of our variable 
or processes probability, it is not always the most useful for calculations. 
Often times we will want to look at its derivative, the probability density 
function (pdf). We define the the pdf as 

Equation: 


Equation: 


fr(x) dx = Prien < X < a+ dx] 


[link] reveals some of the physical significance of the density function. This 
equations tells us the probability that our random variable falls within a 
given interval can be approximated by f,(2) dx. From the pdf, we can now 
use our knowledge of integrals to evaluate probabilities from the above 
approximation. Again we can also define a joint density function which 
will include multiple random variables just as was done for the distribution 
function. The density function is used for a variety of calculations, such as 
finding the expected value or proving a random variable is stationary, to 
name a few. 


Note: The above examples explain the distribution and density functions in 
terms of a single random variable, X. When we are dealing with signals 
and random processes, remember that we will have a set of random 
variables where a different random variable will occur at each time 
instance of the random process, X(¢,). In other words, the distribution and 
density function will also need to take into account the choice of time. 


Stationarity 


Below we will now look at a more in depth and mathematical definition of 
a stationary process. As was mentioned previously, various levels of 
stationarity exist and we will look at the most common types. 


First-Order Stationary Process 


A random process is classified as first-order stationary if its first-order 
probability density function remains equal regardless of any shift in time to 
its time origin. If we let x;, represent a given value at time ¢1, then we 
define a first-order stationary as one that satisfies the following equation: 
Equation: 


fe(Xt,) rs fe(2t,47) 


The physical significance of this equation is that our density function, 
fe(xz,), is completely independent of ¢; and thus any time shift, 7. 


The most important result of this statement, and the identifying 
characteristic of any first-order stationary process, is the fact that the mean 
is a constant, independent of any time shift. Below we show the results for a 
random process, X, that is a discrete-time signal, x[n]. 

Equation: 


X = m,|n| 
E\z|n]] 
constant (independent of n) 


| 


| 


Second-Order and Strict-Sense Stationary Process 


A random process is classified as second-order stationary if its second- 
order probability density function does not vary over any time shift applied 
to both values. In other words, for values x;, and x;, then we will have the 
following be equal for an arbitrary time shift T. 

Equation: 


f | Cts Lty) =7, Aen Leia) 


From this equation we see that the absolute time does not affect our 
functions, rather it only really depends on the time difference between the 
two variables. Looked at another way, this equation can be described as 
Equation: 


Pr[X(t1) < 21, X(t2) < ve] = Pr[X(t, +7) < 21, X(t2 +7) < a9] 


These random processes are often referred to as strict sense stationary 
(SSS) when all of the distribution functions of the process are unchanged 
regardless of the time shift applied to them. 


For a second-order stationary process, we need to look at the 
autocorrelation function to see its most important property. Since we have 
already stated that a second-order stationary process depends only on the 
time difference, then all of these types of processes have the following 
property: 

Equation: 


Reo(t,t +7) 
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Wide-Sense Stationary Process 


As you begin to work with random processes, it will become evident that 
the strict requirements of a SSS process is more than is often necessary in 
order to adequately approximate our calculations on random processes. We 
define a final type of stationarity, referred to as wide-sense stationary 
(WSS), to have slightly more relaxed requirements but ones that are still 
enough to provide us with adequate results. In order to be WSS a random 
process only needs to meet the following two requirements. 


1. X = E/z|n]] = constant 
2. BE X(tbr)| =] As(7) 


Note that a second-order (or SSS) stationary process will always be WSS; 
however, the reverse will not always hold true. 


Correlation and Covariance of a Random Signal 


When we take the expected value, or average, of a random process, we 
measure several important characteristics about how the process behaves in 
general. This proves to be a very important observation. However, suppose 
we have several random processes measuring different aspects of a system. 
The relationship between these different processes will also be an important 
observation. The covariance and correlation are two important tools in 
finding these relationships. Below we will go into more details as to what 
these words mean and how these tools are helpful. Note that much of the 
following discussions refer to just random variables, but keep in mind that 
these variables can represent random signals or random processes. 


Covariance 


To begin with, when dealing with more than one random process, it should 
be obvious that it would be nice to be able to have a number that could 
quickly give us an idea of how similar the processes are. To do this, we use 
the covariance, which is analogous to the variance of a single variable. 


Covariance 
A measure of how much the deviations of two or more variables or 
processes match. 


For two processes, X and Y, if they are not closely related then the 
covariance will be small, and if they are similar then the covariance will be 
large. Let us clarify this statement by describing what we mean by "related" 
and "similar." Two processes are "closely related" if their distribution 
spreads are almost equal and they are around the same, or a very slightly 
different, mean. 


Mathematically, covariance is often written as o,, and is defined as 
Equation: 


COv(A;Y) = Oxy 


Ale) 


This can also be reduced and rewritten in the following two forms: 
Equation: 


Ory = (xy) _ xy 


Equation: 


oy= ff (x-x] (¥-¥) feu) dedy 


Useful Properties 


e If X and Y are independent and uncorrelated or one of them has zero 
mean value, then 


Ory = 0 


If X and Y are orthogonal, then 
Ory = — (E[XJE[Y]) 
e The covariance is symmetric 
cov (X,Y) = cov(Y, X) 
e Basic covariance identity 
cov (X + Y, Z) = cov(X, Z) + cov (Y, Z) 
e Covariance of equal variables 


cov (X, X) = Var (X) 


Correlation 


For anyone who has any kind of statistical background, you should be able 
to see that the idea of dependence/independence among variables and 
signals plays an important role when dealing with random processes. 
Because of this, the correlation of two variables provides us with a 
measure of how the two variables affect one another. 


Correlation 
A measure of how much one random variable depends upon the other. 


This measure of association between the variables will provide us with a 
clue as to how well the value of one variable can be predicted from the 
value of the other. The correlation is equal to the average of the product of 
two random variables and is defined as 

Equation: 


cor(X,Y) = E|XxY] 
= [% fi cyf(a,y)dady 


Correlation Coefficient 


It is often useful to express the correlation of random variables with a range 
of numbers, like a percentage. For a given set of variables, we use the 
correlation coefficient to give us the linear relationship between our 
variables. The correlation coefficient of two variables is defined in terms of 
their covariance and standard deviations, denoted by az, as seen below 
Equation: 


_ cov(X,Y) 


O70 y 


where we will always have 


=e ps1 


This provides us with a quick and easy way to view the correlation between 
our variables. If there is no relationship between the variables then the 
correlation coefficient will be zero and if there is a perfect positive match it 
will be one. If there is a perfect inverse relationship, where one set of 
variables increases while the other decreases, then the correlation 
coefficient will be negative one. This type of correlation is often referred to 
more specifically as the Pearson's Correlation Coefficient,or Pearson's 
Product Moment Correlation. 


Positive Negative Uncorrelated (No 
Correlation Correlation Correlation) 
variable 2 variable 2 variable 2 


variable 1] variable 1 


Variable ] 


Types of Correlation 


Note: So far we have dealt with correlation simply as a number relating the 
relationship between any two variables. However, since our goal will be to 
relate random processes to each other, which deals with signals as a 
function of time, we will want to continue this study by looking at 
correlation functions. 


Example 


Now let us take just a second to look at a simple example that involves 
calculating the covariance and correlation of two sets of random numbers. 
We are given the following data sets: 


x = {3,1,6,3,4} 
y = {1,5, 3, 4, 3} 


To begin with, for the covariance we will need to find the expected value, or 
mean, of x and y. 


e= —(3+1+6+3+4)=3.4 


y= —(14+54+344+4+3) =3.2 
1 
ry = ~ (3 +5+18+12+ 12) = 10 


Next we will solve for the standard deviations of our two sets using the 
formula below (for a review click here). 


o = z[x 7 E|X])” 
i 
o> = [= (0.16 + 5.76 + 6.76 + 0.16 + 0.36) = 1.625 


1 
oy = \/ g (4.84 + 3.24 + 0.04 + 0.64 + 0.04) = 1.327 


Now we can finally calculate the covariance using one of the two formulas 
found above. Since we calculated the three means, we will use that formula 
since it will be much simpler. 


Ory = 10 — 3.4 x 3.2 = —0.88 


And for our last calculation, we will solve for the correlation coefficient, p. 


—0.88 


= ———..~ = — 0.408 
1.625: x 1.327 


p 


Matlab Code for Example 


The above example can be easily calculated using Matlab. Below I have 
included the code to find all of the values above. 


x = [3 1 6 3 4]; 
y= [15 3 4 3]; 
mx = mean(x) 
my = mean(y) 
mxy = mean(x. *y) 


% Standard Dev. from built-in Matlab 
Functions 

std(x,1) 

std(y,1) 


% Standard Dev. from Equation Above 
(same result as std(?,1)) 

sqrt( 1/5 * sum((x-mx).42)) 

sqrt( 1/5 * sum((y-my).42)) 

cov(x,y,1) 


corrcoef(x,y) 


Autocorrelation of Random Processes 


Before diving into a more complex statistical analysis of random signals 
and processes, let us quickly review the idea of correlation. Recall that the 
correlation of two signals or variables is the expected value of the product 
of those two variables. Since our focus will be to discover more about a 
random process, a collection of random signals, then imagine us dealing 
with two samples of a random process, where each sample is taken at a 
different point in time. Also recall that the key property of these random 
processes is that they are now functions of time; imagine them as a 
collection of signals. The expected value of the product of these two 
variables (or samples) will now depend on how quickly they change in 
regards to time. For example, if the two variables are taken from almost the 
same time period, then we should expect them to have a high correlation. 
We will now look at a correlation function that relates a pair of random 
variables from the same process to the time separations between them, 
where the argument to this correlation function will be the time difference. 
For the correlation of signals from two different random process, look at the 
crosscorrelation function. 


Autocorrelation Function 


The first of these correlation functions we will discuss is the 
autocorrelation, where each of the random variables we will deal with 
come from the same random process. 


Autocorrelation 
the expected value of the product of a random variable or signal 
realization with a time-shifted version of itself 


With a simple calculation and analysis of the autocorrelation function, we 
can discover a few important characteristics about our random process. 
These include: 


1. How quickly our random signal or processes changes with respect to 
the time function 


2. Whether our process has a periodic component and what the expected 
frequency might be 


As was mentioned above, the autocorrelation function is simply the 
expected value of a product. Assume we have a pair of random variables 
from the same process, X; = X(t) and X2 = X(t2), then the 
autocorrelation is often written as 

Equation: 


Rx(t1, t2) E|X1X9| 


= fo fe eieef (a1, 22) da day 


The above equation is valid for stationary and nonstationary random 
processes. For stationary processes, we can generalize this expression a 
little further. Given a wide-sense stationary processes, it can be proven that 
the expected values from our random process will be independent of the 
origin of our time function. Therefore, we can say that our autocorrelation 
function will depend on the time difference and not some absolute time. For 
this discussion, we will let 7 = t2 — t;, and thus we generalize our 
autocorrelation expression as 

Equation: 


Rax (t,t +7) = Rex(7) 
= E[X(t)X(t+7)| 


for the continuous-time case. In most DSP course we will be more 
interested in dealing with real signal sequences, and thus we will want to 
look at the discrete-time case of the autocorrelation function. The formula 
below will prove to be more common and useful than [link]: 

Equation: 


And again we can generalize the notation for our autocorrelation function as 
Equation: 


Ryex[n,n+m) = Ry[m] 
= E|X[n|X[n+ ml] 


Properties of Autocorrelation 


Below we will look at several properties of the autocorrelation function that 
hold for stationary random processes. 


e Autocorrelation is an even function for T 


e The mean-square value can be found by evaluating the autocorrelation 
where 7 = O, which gives us 


Ryx(0) = X? 


e The autocorrelation function will have its largest value when 7 = 0. 
This value can appear again, for example in a periodic function at the 
values of the equivalent periodic points, but will never be exceeded. 


Rxx(0) 2 |Rxx(7)| 


e If we take the autocorrelation of a period function, then R,x(7) will 
also be periodic with the same frequency. 


Estimating the Autocorrleation with Time-Averaging 


Sometimes the whole random process is not available to us. In these cases, 
we would still like to be able to find out some of the characteristics of the 


stationary random process, even if we just have part of one sample function. 
In order to do this we can estimate the autocorrelation from a given 
interval, 0 to 7’ seconds, of the sample function. 

Equation: 


However, a lot of times we will not have sufficient information to build a 
complete continuous-time function of one of our random signals for the 
above analysis. If this is the case, we can treat the information we do know 
about the function as a discrete signal and use the discrete-time formula for 
estimating the autocorrelation. 


Equation: 
1 N-m-1 
<n) = = Qe z|n|z[n + m] 
Examples 


Below we will look at a variety of examples that use the autocorrelation 
function. We will begin with a simple example dealing with Gaussian White 
Noise (GWN) and a few basic statistical properties that will prove very 
useful in these and future calculations. 


Example: 
We will let x|n] represent our GWN. For this problem, it is important to 
remember the following fact about the mean of a GWN function: 


E|z|n|| = 0 


x[n] 


Gaussian density 
function. By 
examination, can 
easily see that 
the above 
statement is true 
- the mean equals 
Zero. 


Along with being zero-mean, recall that GWN is always independent. 
With these two facts, we are now ready to do the short calculations 
required to find the autocorrelation. 


Ryx|[n, n+ m] = Ela[n|a2[n + m]] 


Since the function, x|n], is independent, then we can take the product of 
the individual expected values of both functions. 


Ryx[n,n +m] = Elz|[n||Elae[n + ml] 


Now, looking at the above equation we see that we can break it up further 
into two conditions: one when m and n are equal and one when they are 
not equal. When they are equal we can combine the expected values. We 
are left with the following piecewise function to solve: 


2 Elz([n||E|z[n + mj] if m 40 
xx[n, 2 + m] — E|zx?[n]| if 970 

We can now solve the two parts of the above equation. The first equation is 

easy to solve as we have already stated that the expected value of x[n] will 

be zero. For the second part, you should recall from statistics that the 


expected value of the square of a function is equal to the variance. Thus we 
get the following results for the autocorrelation: 


Unit i) 


R,|n,n +m] = 
| | oo ine 


Or in a more concise way, we can represent the results as 


eit. es m| ao oa? 5[m] 


Crosscotrelation of Random Processes 


Before diving into a more complex statistical analysis of random signals 
and processes, let us quickly review the idea of correlation. Recall that the 
correlation of two signals or variables is the expected value of the product 
of those two variables. Since our main focus is to discover more about 
random processes, a collection of random signals, we will deal with two 
random processes in this discussion, where in this case we will deal with 
samples from two different random processes. We will analyze the 
expected value of the product of these two variables and how they correlate 
to one another, where the argument to this correlation function will be the 
time difference. For the correlation of signals from the same random 
process, look at the autocorrelation function. 


Crosscorrelation Function 


When dealing with multiple random processes, it is also important to be 
able to describe the relationship, if any, between the processes. For 
example, this may occur if more than one random signal is applied to a 
system. In order to do this, we use the crosscorrelation function, where the 
variables are instances from two different wide sense stationary random 
processes. 


Crosscorrelation 
if two processes are wide sense stationary, the expected value of the 
product of a random variable from one random process with a time- 
shifted, random variable from a different random process 


Looking at the generalized formula for the crosscorrelation, we will 
represent our two random processes by allowing U = U(t) and 

V = V(t — 7). We will define the crosscorrelation function as 
Equation: 


Rw(t,t—7) = EUV] 
ff urf(u,v) dvdu 


Just as the case with the autocorrelation function, if our input and output, 
denoted as U(t) and V(t), are at least jointly wide sense stationary, then the 
crosscorrelation does not depend on absolute time; it is just a function of the 
time difference. This means we can simplify our writing of the above 
function as 

Equation: 


Ruy(7) = E[UV] 


or if we deal with two real signal sequences, x{n] and y|n], then we arrive 
at a more commonly seen formula for the discrete crosscorrelation function. 
See the formula below and notice the similarities between it and the 
convolution of two signals: 

Equation: 


Rey(njn—m) = Rey(m) 
= DS welnlyin— ml 


Properties of Crosscorrelation 


Below we will look at several properties of the crosscorrelation function 
that hold for two wide sense stationary (WSS) random processes. 


e Crosscorrelation is not an even function; however, it does have a 
unique symmetry property: 
Equation: 


Ray(—T) = Rye(7) 


e The maximum value of the crosscorrelation is not always when the 
shift equals zero; however, we can prove the following property 
revealing to us what value the maximum cannot exceed. 
Equation: 


IRry(7)| < 1/ Rez (0) Ryy(0) 


e When two random processes are statistically independent then we have 
Equation: 


Ray(T) = Rye(7) 


Examples 


Exercise: 


Problem: 


Let us begin by looking at a simple example showing the relationship 
between two sequences. Using [link], find the crosscorrelation of the 
sequences 


ge) = 4.25 0, 0,2, —3,6,.153,0, 0402} 
y[n] = {...,0,0, 1, —2, 4, 1, —3,0,0,...} 
for each of the following possible time shifts: m = {0, 3, —1}. 


Solution: 


1. For m = 0, we should begin by finding the product sequence 
s[n] = x[n]y|[n]. Doing this we get the following sequence: 


sin] = {...,0,0,2, 6, 24, 1, —9,0,0,...} 


and so from the sum in our crosscotrrelation function we arrive at 
the answer of 


R,,(0) = 22 


2. For m = 3, we will approach it the same was we did above; 
however, we will now shift y|n] to the right. Then we can find the 


product sequence s{n] = x|[n]|y|n — 3], which yields 
s|n] = {...,0,0,0,0,0,1, —6,0,0,...} 
and from the crosscorrelation function we atrive at the answer of 
RaQ) = -6 


3. Form = —1, we will again take the same approach; however, we 
will now shift y[7] to the left. Then we can find the product 
sequence s(n] = z|n]y|n + 1], which yields 


s|n] = 1.x iec0; —A4, —12, 6, —3, 0,0,0,...} 
and from the crosscorrelation function we arrive at the answer of 


Ae 13 


Cauchy-Schwarz Inequality 
This module provides both statement and proof of the Cauchy-Schwarz inequality 
and discusses its practical implications with regard to the matched filter detector. 


Introduction 


Any treatment of linear algebra as relates to signal processing would not be 
complete without a discussion of the Cauchy-Schwarz ineqaulity, a relation that 
enables a wide array of signal procesing applications related to pattern matching 
through a method called the matched filter. Recall that in standard Euclidean 
space, the angle 0 between two vectors x, y is given by 

Equation: 


(a, y) 
Ce Talila: 


Since cos (8) < 1, it follows that 
Equation: 


(x,y) |? < (x, a) ty, y). 


Furthermore, equality holds if and only if cos (#9) = 0, implying that 
Equation: 


I(x, y)|" = (a, 2) (y, y) 


if and only if y = az for some real a. This relation can be extended to all inner 
product spaces over a real or complex field and is known as the Cauchy-Schwarz 
inequality, which is of great importance to the study of signals. 


The Cauchy-Schwarz Inequality 


Statement of the Cauchy-Schwarz Inequality 


The general statement of the Cauchy-Schwarz inequality mirrors the intuition for 
standard Euclidean space. Let V be an inner product space over the field of 


complex numbers C with inner product (-, -). For every pair of vectors x,y € V 
the inequality 
Equation: 


(x, y)|" < (a, 2) (yy) 


holds. Furthermore, the equality 
Equation: 


I(x, y)|" = (a, 2) (y, y) 


holds if and only if y = ax for some a € C. That is, equality holds if and only if 
x and y are linearly dependent. 


Proof of the Cauchy-Schwarz Inequality 


Let V be a vector space over the real or complex field F’, and let x, y € V be 
given. In order to prove the Cauchy-Schwarz inequality, it will first be proven that 


(x, y)|° = (x, x) (y, y) if y = ax for some a € F. It will then be shown that 
(x,y) |? < (x, x) (y, y) if y 4 az foralla € F. 


Consider the case in which y = az for some a € F’. From the properties of inner 
products, it is clear that 
Equation: 


x,y)!’ = |(x, az)|? 


=|a(z,2)|?, 


Hence, it follows that 
Equation: 


Similarly, it is clear that 
Equation: 


(x, &)(Yy,y) = (#, a) (ax, ax) 
= (ee loa (a,x) 


= |a|*(x, x)”. 


Thus, it is proven that |(x, y)|” = (x, x) (y, y) if = ay for somea € F. 


Next, consider the case in which y ~ az for all a € F’, which implies that y 4 0 
so (y, y) # 0. Thus, it follows by the properties of inner products that, for all 

a € F, (x — ay, x — ay) > 0. This can be expanded using the properties of inner 
products to the expression 

Equation: 


(t—ay,x—ay) = (x,@ — ay) — aly, — ay) 
= (a,x) —a(zx,y) —a(y, x) + |al” (y,y) 


_ (x,y) 
Choosing a = Ga 
Equation: 
Y, & v,Y L,Y)\Y, & 
(x —ay,z—ay) = (x, x) — (a, 9) - (ya) + eis : ty, 9) 
(yy) (y, y) (y, y) 
Si pedipn (x,y) (y, &) 
(ys ¥) 
Hence, it follows that (x, x) — sealtye) > 0. Consequently, 


(x, x) (y,y) — (x,y) (x,y) > 0. Thus, it can be concluded that 
(x,y) |? < (x, x) (y, y) ify 4 az foralla € F. 


Therefore, the inequality 
Equation: 


I(x, y) |" < (x, x) (y, y) 


holds for all z, y € V, and equality 
Equation: 


ayy) = (ao, 2) (gy, 9) 


holds if and only if y = az for somea € F. 


Additional Mathematical Implications 


Consider the maximization of ( 721 : th) where the norm || - || = (-, -) is 


induced by the inner product. By the Cauchy-Schwarz inequality, we know that 


2 2 
Car ir) < 1 and that Car tr) = 1 if and only if ty ; = 


for 


Te il 
x y 
Tel? Tel attains a maximum where Tw i= ar: ] 


some a € C. Thus, collecting the scalar variables, (ser tr) attains a 


some a € C. Hence, 


maximum where y = az. This result will be particulaly useful in developing the 
matched filter detector techniques. 


Matched Filter Detector 


Background Concepts 


A great many applications in signal processing, image processing, and beyond 
involve determining the presence and location of a target signal within some other 
signal. A radar system, for example, searches for copies of a transmitted radar 
pulse in order to determine the presence of and distance to reflective objects such 
as building or aircraft. A communication system searches for copies of 
waveforms representing digital Os and 1s in order to receive a message. 


As has already been shown, the expression Car th) attains its upper bound, 


which is 1, when y = az for some scalar a in a real or complex field. The lower 
bound, which is 0, is attained when z and y are orthogonal. In informal intuition, 
this means that the expression is maximized when the vectors x and y have the 
same shape or pattern and minimized when z and y are very different. A pair of 


vectors with similar but unequal shapes or patterns will produce relatively large 
value of the expression less than 1, and a pair of vectors with very different but 
not orthogonal shapes or patterns will produce relatively small values of the 
expression greater than 0. Thus, the above expression carries with it a notion of 
the degree to which two signals are “alike”, the magnitude of the normalized 
correlation between the signals in the case of the standard inner products. 


This concept can be extremely useful. For instance consider a situation in which 
we wish to determine which signal, if any, from a set X of signals most resembles 
a particular signal y. In order to accomplish this, we might evaluate the above 
expression for every signal x € X, choosing the one that results in maxima 
provided that those maxima are above some threshold of “likeness”. This is the 
idea behind the matched filter detector, which compares a set of signals against a 
target signal using the above expression in order to determine which among them 
are most like the target signal. For a detailed treatment of the applications of the 
matched filter detector see the liked module. 


Signal Comparison 


The simplest variant of the matched filter detector scheme would be to find the 
member signal in a set X of signals that most closely matches a target signal y. 
Thus, for every x € X we wish to evaluate 


Equation: 
x y 
m (x,y) = (= a) 
Nell” |lyl 


in order to compare every member of X to the target signal y. Since the member 
of X which most closely matches the target signal y is desired, ultimately we 
wish to evaluate 

Equation: 


x y 
Br SBN GEX \ Teel lull / 


Note that the target signal does not technically need to be normalized to produce a 
maximum, but gives the desirable property that m(z, y) is bounded to (0, 1]. 


The element x, € X that produces the maximum value of m(z, y) is not 
necessarily unique, so there may be more than one matching signal in X. 
Additionally, the signal z,, € X producing the maximum value of m(a, y) may 
not produce a very large value of m(z, y) and thus not be very much like the 
target signal y. Hence, another matched filter scheme might identify the argument 
producing the maximum but only above a certain threshold, returning no 
matching signals in X if the maximum is below the threshold. There also may be 
a signal x € X that produces a large value of m(z, y) and thus has a high degree 
of “likeness” to y but does not produce the maximum value of m(z, y). Thus, yet 
another matched filter scheme might identify all signals in _X producing local 
maxima that are above a certain threshold. 


Example: 

For example, consider the target signal given in [link] and the set of two signals 
given in [link]. By inspection, it is clear that the signal gz is most like the target 
signal f. However, to make that conclusion mathematically, we use the matched 
filter detector with the LD inner product. If we were to actually make the 
necessary computations, we would first normalize each signal and then compute 
the necessary inner products in order to compare the signals in _X with the target 
signal f. We would notice that the absolute value of the inner product for gz with 
f when normalized is greater than the absolute value of the inner product of g; 
with f when normalized, mathematically stated as 

Equation: 


a f 
92 —alQMax ze sg, go} fiz? {Il . 


Template Signal 
f(t) 


We wish to find a 
match for this 
target signal in the 


set of signals 
below. 


Candidate Signals 
gi (t) g z(t) 


We wish to find a match for the above 
target signal in this set of signals. 


Pattern Detection 


A somewhat more involved matched filter detector scheme would involve 
attempting to match a target time limited signal y = f to a set of of time shifted 
and windowed versions of a single signal X = {wS;g|t € IR} indexed by R. The 
windowing funtion is given by w (t) = u(t — t,) — u(t — ty) where [t,, to] is 
the interval to which f is time limited. This scheme could be used to find portions 
of g that have the same shape as f. If the absolute value of the inner product of 
the normalized versions of f and w/S¢g is large, which is the absolute value of the 
normalized correlation for standard inner products, then g has a high degree of 
“likeness” to f on the interval to which f is time limited but left shifted by t. Of 
course, if f is not time limited, it means that the entire signal has a high degree of 
“likeness” to f left shifted by t. 


Thus, in order to determine the most likely locations of a signal with the same 
shape as the target signal f in a signal g we wish to compute 
Equation: 


rere a ea 
FIl [}wSeg | 


to provide the desired shift. Assuming the inner product space examined is L(R 
(similar results hold for L (R [a, b)), lz (Z), and lz (Z [a, b))), this produces 
Equation: 


1 CO 
tm =argmax;cp TfliwSal i f (r)w (rT) g(t — t)dr . 


Since f and w are time limited to the same interval 
Equation: 


1 t 
Vilese tg 2 OF 


tm =argMmax;cr 


Making the subsitution h (t) = g(—t), 
Equation: 


1 to 
TAlwaa f, DREW Der 


tm =argmax,p 


Noting that this expression contains a convolution operation 
Equation: 


(f*A)(#) 
Fl llwSeg || 


tm =argmax;-p 


where A is the conjugate of the time reversed version of g defined by 


h(t) = o(—t). 


In the special case in which the target signal f is not time limited, w has unit 
value on the entire real line. Thus, the norm can be evaluated as 
[|wStg|| = ||Szg]| = ||g|| = ||A||. Therefore, the function reduces to 


tm =argmax;cp oe where h(t) = g(—t). The function f +* g = 


known as the normalized cross-correlation of f and g. 


Hence, this matched filter scheme can be implemented as a convolution. 
Therefore, it may be expedient to implement it in the frequency domain. Similar 
results hold for the Lz (R [a, b)), lz (Z), and lz (Z [a, b]) spaces. It is especially 
useful to implement the /2 (Z [a, b]) cases in the frequency domain as the power 
of the Fast Fourier Transform algorithm can be leveraged to quickly perform the 
computations in a computer program. In the Dz (R [{a, b)) and I, (Z [a, b}) cases, 
care must be taken to zero pad the signal if wrap-around effects are not desired. 
Similar results also hold for spaces on higher dimensional intervals with the same 
inner products. 


Of course, there is not necessarily exactly one instance of a target signal in a 
given signal. There could be one instance, more than one instance, or no instance 
of a target signal. Therefore, it is often more practical to identify all shifts 
corresponding to local maxima that are above a certain threshold. 


Example: 

The signal in [link] contains an instance of the template signal seen in [link] 
beginning at time ¢ = s; as shown by the plot in [link]. Therefore, 
Equation: 


_f _ wig ) 
WAI I\wSe9 | 


§1 =—argmax,cpr ( 


Pattern Signal 
f(t) 


This function 
shows tha pattern 
we are looking for 
in the signal 
below, which 
occurs at time 
t=s ile 


Longer Signal 
g(t) 


This signal contains an instance of the 
above signal starting at time t = sj. 


Absolute Value of Output 
Output 


Si S2 


This signal shows a sketch of the absolute 
value of the matched filter output for the 
interval shown. Note that this was just an 
"eyeball approximation" sketch. Observe 
the pronounced peak at time t = sj. 


Cauchy-Schwarz Inequality Video Lecture 


Proof of the Cauchy-Schwarz Inequality 
[missing_resource: 
http://www. youtube.com/v/r2PogGD18_U&hl=en_US&fs=1&rel=0] 


Video lecture on the proof of the Cauchy-Schwarz inequality from Khan 
Academy. Only part of the theorem is proven. 


Cauchy-Schwarz Inequality Summary 


As can be seen, the Cauchy-Schwarz inequality is a property of inner product 
spaces over real or complex fields that is of particular importance to the study of 
signals. Specifically, the implication that the absolute value of an inner product is 
maximized over normal vectors when the two arguments are linearly dependent is 
key to the justification of the matched filter detector. Thus, it enables the use of 
matched filters for such pattern matching applications as image detection, 
communications demodulation, and radar signal analysis. 


The Q-function 


The @-function is a convenient way to express right-tail probabilities for 
normal (Gaussian) random variables. For x € R, Q(z) is defined as the 
probability that a standard normal random variable (zero mean, unit 
variance) exceeds x: 


1 ~~ _# 


Q is a mapping from R to {0, 1}. One may also define Q(—oo) = 1 and 
Q(oco) = 0. 


Note:Some authors define the Q-function in different ways. One 
alternative is to define Q(x) = F(x) — F(0). This definition is discussed 
at MathWorld. 


[missing resource: .png] 


Q(z) is 
represented by the 
shaded region. 


Since Q(z) is monotonically decreasing, it has a well-defined inverse 
Q': {0,1} 5R. 
[missing resource: .png] 


A plot of Q(z) 


If F(x) denotes the cumulative distribution function of a standard normal, 
then clearly Q(x) = 1 — F(a). For this reason Q is also called the 
complementary cumulative distribution function. The Q-function is 
useful because the tail probability cannot be evaluated symbolically, and so 
Q(x) offers a concise notation for this integral. It is similar to the gamma 
and beta functions in this respect. 


Arbitrary Mean and Variance 


The @-function is also useful for expressing right-tail probabilities of 
nonstandard normal variates. If 


X~ N (pu, o*) 
then 


P Som 
Oo 


~ W(0,1) 


To express Pr[X > 4] in terms of Q, where y € R, define 7 = —". Then 
Equation: 


| 


Pr[X > 4| Pr[|X > no + py] 


= Pr|# > n] 
Q(n) 
Q(7>) 


| 


Relation to Erf and Erfinv 


The erf function is defined as 


2 ee 
exf (x) = — | edd 
0 


Both erf and its inverse, erfinv, are built into many common mathematical 
software packages such as Mathematica and Matlab. Therefore, they can be 
used to numerically evaluate Q and Q~!. By a change of variables, we have 


Q(n)=5 1 — erf a 


and 


Q*(y) — V2erfinv (1 — 2y) 


Approximations 


One approximation that is sometimes useful for z away from zero is 


il S52 
Q(z) ~ ie" 


Introduction to Detection Theory 


Introduction 


The intent of detection theory is to provide rational (instead of arbitrary) 
techniques for determining which of several conceptions--models--of data 
generation and measurement is most "consistent" with a given set of data. 
In digital communication, the received signal must be processed to 
determine whether it represents a binary "0" or "1"; in radar or sonar, the 
presence or absence of a target must be determined from measurements of 
propagating fields; in seismic problems, the presence of oil deposits must 
be inferred from measurements of sound propagation in the earth. Using 
detection theory, we will derive signal processing algorithms which will 
give good answers to questions such as these when the information-bearing 
signals are corrupted by superfluous signals (noise). 


The detection theory's foundation rests on statistical hypothesis testing 
(Cramér, 1946, Chapter 35; Lehman, 1986; Poor, 1988, Chapter 2; van 
the associated probabilistic structures), a random vector r expressing the 
observed data, and a listing of the probabilistic models--the hypotheses-- 
which may have generated 7, we want a systematic, optimal method of 
determining which model corresponds to the data. In the simple case where 
only two models-- .Z and .4,--are possible, we ask, for each set of 
observations, what is the "best" method of deciding whether .Z or 4, was 
true? We have many ways of mathematically stating what "best" means: we 
shall initially choose the average cost of each decision as our criterion for 
correctness. This seemingly arbitrary choice of criterion will be shown later 
not to impose rigid constraints on the algorithms that solve the hypothesis 
testing problem. Over a variety of reasonable criteria, one central solution 
to evaluating which model describes observations--the likelihood ratio test-- 
will persistently emerge; this result will form the basis of all detection 
algorithms. 


Detection problems become more elaborate and complicated when models 
become vague. Models are characterized by probability distributions, and 
these distributions suffice in the likelihood ratio test. Vagueness does not 


refer to this stochastic framework; rather, it refers to uncertainties in the 
probability distribution itself. The distribution may depend on unknown 
parameters, like noise power level. The distribution most certainly depends 
on signal structure; suppose that is partially or completely unknown? The 
most difficult (and interesting) problems emerge when uncertainties arise in 
the probability distributions themselves. For example, suppose the only 
model information we have is through data; how would an optimal detector 
be derived then? 


Along the way we will discover that a general geometric picture of 
detection emerges: Ease of a detection problem depends on how "far apart" 
the models are from each other. This geometric framework turns out to be 
elaborate, but underlies modern detection theory and forms links to 
information theory. 


Discrete-Time Detection Theory 


Introduction 


Detection theory applies optimal model evaluation to signals (Helstrom, 
Poor, van Trees). Usually, we measure a signal in the presence of additive 
noise over some finite number of samples. Each observed datum is of the 
form s(1) + (1) , where s(1) denotes the J‘" signal value and n(J) the J“ 
noise value. In this and in succeeding sections of this chapter, we focus the 
general methods of evaluating models. 


Detection of Signals in Gaussian Noise 


For the moment, we assume we know the joint distribution of the noise 
values. In most cases, the various models for the form of the observations - 
the hypothesis - do not differ because of noise characteristics. Rather, the 
signal component determines model variations and the noise is statistically 
independent of the signal; such is the specificity of detection problems in 
contrast to the generality of model evaluation. For example, we may want 
to determine whether a signal characteristic of a particular ship is present in 
a sonar array's output (the signal is known) or whether no ship is present 
(zero-valued signal). 


To apply optimal hypothesis testing procedures previously derived, we first 
obtain a finite number L of observations r(l), 1 € {0,..., 2 — 1}. These 
observations are usually obtained from continuous-time observations in one 
of two ways. Two commonly used methods for passing from continuous- 
time to discrete-time are known: integrate-and-dump and sampling. 
These techniques are illustrated in [link]. 


The two most common methods of converting continuous-time 
observations into discrete-time ones are shown. In the upper panel, the 
integrate-and-dump method is shown: the input is integrated over an 


interval of duration A and the result sampled. In the lower panel, the 
sampling method merely samples the input every A seconds. 


Integrate-and-Dump 


Sampling ys rm 
Filter 


mA 


Sampling 


Integrate-and-Dump 


In this procedure, no attention is paid to the bandwidth of the noise in 
selecting the sampling rate. Instead, the sampling interval A is selected 
according to the characteristics of the signal set. Because of the finite 
duration of the integrator, successive samples are statistically independent 
when the noise bandwidth exceeds = Consequently, the sampling rate can 


be varied to some extent while retaining this desirable analytic property. 


Sampling 


Traditional engineering considerations governed the selection of the 
sampling filter and the sampling rate. As in the integrate-and-dump 
procedure, the sampling rate is chosen according to signal properties. 
Presumably, changes in sampling rate would force changes in the filter. As 
we shall see, this linkage has dramatic implications on performance. 


With either method, the continuous-time detection problem of selecting 
between models (a binary selection is used here for simplicity) 


My: r(t)= s(t) +nit) 0<t<T 
M,:r(th=s'(t)+n(t) 0<t<T 


where {s‘(t)} denotes the known signal set and n(t) denotes additive 
noise modeled as a stationary stochastic process [footnote] is converted into 
the discrete-time detection problem 


My: ri = stn Oa1< £ 
M27, = 8, +n O-<t< LL 


where the sampling interval is always taken to divide the observation 
interval 7’: L = We form the discrete-time observations into a vector: 


r = (r(0)...r(L — 1))*. The binary detection problem is to distinguish 
between two possible signals present in the noisy output waveform. 


My:r=S)+n 
My:r=s,in 


To apply our model evaluation results, we need the probability density of r 
under each model. As the only probabilistic component of the observations 
is the noise, the required density for the detection problem is given by 


Pr\.a, (7) =Pn (7 — 81) 


and the corresponding likelihood ratio by 


A(r) — Pn (r — $1) 
Pn (r aS 0) 
Much of detection theory revolves about interpreting this likelihood ratio 


and deriving the detection threshold (either threshold or 7¥). 
We are not assuming the amplitude distribution of the noise to be Gaussian. 


Hypothesis Testing 


Suppose you measure a collection of scalars x;,..., 2. You believe the 
data is distributed in one of two ways. Your first model, call it Ho, 
postulates the data to be governed by the density f(x) (some fixed 
density). Your second model, Hj, postulates a different density f1(z). 
These models, termed hypotheses, are denoted as follows: 


Hy itn Jole) n= 1LacN 
Hitt fie) n= le 


A hypothesis test is a rule that, given a measurement z, makes a decision 
as to which hypothesis best "explains" the data. 


Example: 

Suppose you are confident that your data is normally distributed with 
variance 1, but you are uncertain about the sign of the mean. You might 
postulate 


Ag ty, 7 (1, 1) 
ie ey (a) 


These densities are depicted in [link]. 


Assuming each hypothesis is a priori equally likely, an intuitively 
appealing hypothesis test is to compute the sample mean x = + SS a Ba 


, and choose Ho if x < 0, and H; if x > 0. As we will see later, this test is 
in fact optimal under certain assumptions. 


Generalizations and Nomenclature 


The concepts introduced above can be extended in several ways. In what 
follows we provide more rigorous definitions, describe different kinds of 
hypothesis testing, and introduce terminology. 


Data 


In the most general setup, the observation is a collection 21,..., xy of 
random vectors. A common assumption, which facilitates analysis, is that 
the data are independent and identically distributed (IID). The random 
vectors may be continuous, discrete, or in some cases mixed. It is generally 
assumed that all of the data is available at once, although for some 
applications, such as Sequential Hypothesis Testing, the data is a never 
ending stream. 


Binary Versus M-ary Tests 


When there are two competing hypotheses, we refer to a binary hypothesis 
test. When the number of hypotheses is M > 2, we refer to an M-ary 
hypothesis test. Clearly, binary is a special case of /-ary, but binary tests 
are accorded a special status for certain reasons. These include their 
simplicity, their prevalence in applications, and theoretical results that do 
not carry over to the M-ary case. 


Example: 
Phase-Shift Keying 


Suppose we wish to transmit a binary string of length r over 
a noisy communication channel. We assign each of the 

M = 2" possible bit sequences to a signal s*, 

k = {1,...,M} where 


2m(k— I 
s* = cos (2nfon ale ae) 


This symboling scheme is known as phase-shift keying 
(PSK). After transmitting a signal across the noisy channel, 
the receiver faces an M-ary hypothesis testing problem: 


H):x2=s'+w 


Hy :2=s@+w 


where w ~ N (0, ca 


In many binary hypothesis tests, one hypothesis represents the absence of a 
ceratin feature. In such cases, the hypothesis is usually labelled Ho and 
called the null hypothesis. The other hypothesis is labelled Hy and called 
the alternative hypothesis. 


Example: 
Waveform Detection 


Consider the problem of detecting a known signal 
s = (s1...8y)’ in additive white Gaussian noise (AWGN). 
This scenario is common in sonar and radar systems. 
Denoting the data as @ = (x1...ay)", our hypothesis 
testing problem is 
dice ont} 

A, :x2=s+w 
where w ~ WV (0, od Hg is the null hypothesis, 
corresponding to the absence of a signal. 


Tests and Decision Regions 


Consider the general hypothesis testing problem where we have WN d- 
dimensional observations 21,...,£j and M hypotheses. If the data are 
real-valued, for example, then a hypothesis test is a mapping 


y: (R*)” > {1,...,M} 


For every possible realization of the input, the test outputs a hypothesis. 
The test y partitions the input space into a disjoint collection R,,..., Ry, 
where 


Re = A Giges Qn) Ol Cisse stn SF} 


The sets Rx are called decision regions. The boundary between two 
decision regions is a decision boundary. [link] depicts these concepts when 
d=2N = 1, ond = 3. 


Simple Versus Composite Hypotheses 


If the distribution of the data under a certain hypothesis is fully known, we 
call it a simple hypothesis. All of the hypotheses in the examples above are 
simple. In many cases, however, we only know the distribution up to certain 
unknown parameters. For example, in a Gaussian noise model we may not 
know the variance of the noise. In this case, a hypothesis is said to be 
composite. 


Example: 
Consider the problem of detecting the signal 


Si = COS ota ao) Vl lee 
where & is an unknown delay parameter. Then 
el Gye 
A,:x2=s+w 


is a binary test of a simple hypothesis (Hg) versus a composite alternative. 
Here we are assuming w, ~ WV (0, an, with ao? known. 


Often a test involving a composite hypothesis has the form 
A QO: d= Ao 
H,:04% 


where 6p is fixed. Such problems are called two-sided because the 
composite alternative "lies on both sides of Hg." When @ is a scalar, the test 


Ho :9< 4% 
H,:0> 4 


is called one-sided. Here, both hypotheses are composite. 


Example: 

Suppose a coin turns up heads with probability p. We want to assess 
whether the coin is fair (p = 5). We toss the coin NV times and record 
L1,--+-;LN (Lp = 1 means heads and z,, = O means tails). Then 


is a binary test of a simple hypothesis (Hg) versus a composite alternative. 
This is also a two-sided test. 


Errors and Probabilities 


In binary hypothesis testing, assuming at least one of the two models does 
indeed correspond to reality, there are four possible scenarios: 


e Case 1 Hp is true, and we declare Ho to be true 
e Case 2 Hy is true, but we declare H; to be true 
e Case 3 H; is true, and we declare H to be true 
e Case 4 H; is true, but we declare Ho to be true 


In cases 2 and 4, errors occur. The names given to these errors depend on 
the area of application. In statistics, they are called type I and type II 
errors respectively, while in signal processing they are known as a false 
alarm or a miss. 


Consider the general binary hypothesis testing problem 
Ho :@%~ fo(a),0 E Oo 
Ai: a2wrw fo(a), 0 EO; 


If Ho is simple, that is, 09 = {6}, then the size (denoted a), also called 
the false-alarm probability (Pr), is defined to be 


a = Pr = Pr[6o, declareH}| 
When @, is composite, we define 
a = Pr = supgce, (Pr/0, declareH;]) 


For 8 € @,, the power (denoted {), or detection probability (Pp), is 
defined to be 


B = Pp = Pr{6, declareH,| 
The probability of a type II error, also called the miss probability, is 
Py=1-—-Pp 


If H; is composite, then 8 = 6(6) is viewed as a function of 6. 


Criteria in Hypothesis Testing 


The design of a hypothesis test/detector often involves constructing the 
solution to an optimization problem. The optimality criteria used fall into 
two classes: Bayesian and frequent. 


Representing the former approach is the Bayes Risk Criterion. Representing 
the latter is the Neyman-Pearson Criterion. These two approaches are 
developed at length in separate modules. 


Statistics Versus Engineering Lingo 


The following table, adapted from Kay, _p.65, summarizes the different 
terminology for hypothesis testing from statistics and signal processing: 


Statistics Signal Processing 


Hypothesis Test Detector 

Null Hypothesis Noise Only Hypothesis 
Alternate Hypothesis Signal + Noise Hypothesis 
Critical Region Signal Present Decision Region 
Type I Error False Alarm 

‘Type II Error Miss 

Size of Test (a) Probability of False Alarm (Pr) 


Power of Test () Probability of Detection (Pp) 


The Neyman-Pearson Criterion 


In hypothesis testing, as in all other areas of statistical inference, there are two 
major schools of thought on designing good tests: Bayesian and frequentist (or 
classical). Consider the simple binary hypothesis testing problem 


KH: 2 ~ fo(x) 
FG: a~ fi(x) 


In the Bayesian setup, the prior probability 7; = Pr|.#] of each hypothesis 
occurring is assumed known. This approach to hypothesis testing is 
represented by the minimum Bayes risk criterion and the minimum probability 
of error criterion. 


In some applications, however, it may not be reasonable to assign an a priori 
probability to a hypothesis. For example, what is the a priori probability of a 
supernova occurring in any particular region of the sky? What is the prior 
probability of being attacked by a ballistic missile? In such cases we need a 
decision rule that does not depend on making assumptions about the a priori 
probability of each hypothesis. Here the Neyman-Pearson criterion offers an 
alternative to the Bayesian framework. 


The Neyman-Pearson criterion is stated in terms of certain probabilities 
associated with a particular hypothesis test. The relevant quantities are 
summarized in [link]. Depending on the setting, different terminology is used. 


Statistics Signal Processing 

Probability Name Notation Name Notation 
false- 

Po(declareH ) size a alarm Pr 


probability 


Statistics Signal Processing 


Probability Name Notation Name Notation 
detection 
P, (declare ) power B arobability Pp 


Here P;(declare.#%;) dentoes the probability that we declare hypothesis #4 to 
be in effect when #% is actually in effect. The probabilities Po (declare) 
and P; (declare) (sometimes called the miss probability), are equal to 

1 — Pr and 1 — Pp, respectively. Thus, P- and Pp represent the two degrees 
of freedom in a binary hypothesis test. Note that Pp and Pp do not involve a 
priori probabilities of the hypotheses. 


These two probabilities are related to each other through the decision regions. 
If R, is the decision region for #4, we have 


Pe= f folw)de 


Pp =f Aw dz 


The densities f;(a) are nonnegative, so as R, shrinks, both probabilities tend 
to zero. As R; expands, both tend to one. The ideal case, where Pp = 1 and 
Py = 0, cannot occur unless the distributions do not overlap (i.e., 

f fo(x) fi(a) d x = 0). Therefore, in order to increase Pp, we must also 
increase Pp. This represents the fundamental tradeoff in hypothesis testing 
and detection theory. 


Example: 
Consider the simple binary hypothesis test of a scalar measurement x: 


Foe ey (0,1) 
oi ee (lee) 


Suppose we use a threshold test 


where 7 € R is a free parameter. Then the false alarm and detection 


probabilities are 
CO 
1 
Pe= [ 
y V2T 


sae | (20E 
Pp= | e °F? di=Q(y-1) 
y V/27 


where @ denotes the Q-function. These quantities are depicted in [link]. 


eo? dt =Q(1) 


False alarm and detection values for a certain threshold. 


Since the Q-function is monotonicaly decreasing, it is evident that both Pp 
and Pr decay to zero as ¥y increases. There is also an explicit relationship 


Pp = Q(Q" (Pr) — 1) 


A common means of displaying this relationship is with a receiver operating 
characteristic (ROC) curve, which is nothing more than a plot of Pp versus 
Pr ((ink)). 


ROC curve for this example. 


The Neyman-Pearson Lemma: A First Look 


The Neyman-Pearson criterion says that we should construct our decision rule 
to have maximum probability of detection while not allowing the probability 
of false alarm to exceed a certain value a. In other words, the optimal detector 


according to the Neyman-Pearson criterion is the solution to the following 


constrainted optimization problem: 


Neyman-Pearson Criterion 


Equation: 


max {Pp},such thatPr < a 


The maximization is over all decision rules (equivalently, over all decision 
regions Ro, R,). Using different terminology, the Neyman-Pearson criterion 
selects the most powerful test of size (not exceeding) a. 


Fortunately, the above optimization problem has an explicit solution. This is 
given by the celebrated Neyman-Pearson lemma, which we now state. To 
ease the exposition, our initial statement of this result only applies to 
continuous random variables, and places a technical condition on the densities. 
A more general statement is given later in the module. 

Neyman-Pearson Lemma: initial statement 


Consider the test 


fi(z) 
fo(x) 
satisfies the condition that for each y € R, A(a) takes on the value y with 
probability zero under hypothesis .. The solution to the optimization 
problem in [link] is given by 


where f;(a) is a density. Define A(a) = , and assume that A(a:) 


where 77 is such that 
Pr= [ foe) de=a 


If ~ = 0, then 7 = oo. The optimal test is unique up to a set of probability 
zero under #% and “%. 

The optimal decision rule is called the likelihood ratio test. A(x) is the 
likelihood ratio, and 7 is a threshold. Observe that neither the likelihood 


ratio nor the threshold depends on the a priori probabilities Pr[.#]. they 
depend only on the conditional densities f; and the size constraint a. The 
threshold can often be solved for as a function of a, as the next example 
shows. 


Example: 
Continuing with [link], suppose we wish to design a Neyman-Pearson 
decision rule with size constraint a. We have 


Equation: 
eee eat 
Vir 
A(z) = = 
a 
— pio 


By taking the natural logarithm of both sides of the LRT and rarranging 
terms, the decision rule is not changed, and we obtain 


FE, 1 
z2In(n)+ 5 = 
KH 


Thus, the optimal rule is in fact a thresholding rule like we considered in 
[link]. The false-alarm probability was seen to be 


Pr = Q(y) 


Thus, we may express the value of -y required by the Neyman-Pearson lemma 
in terms of a: 


y= Q" (a) 


Sufficient Statistics and Monotonic Transformations 


For hypothesis testing involving multiple or vector-valued data, direct 
evaluation of the size (Pr) and power (Pp) of a Neyman-Pearson decision 
rule would require integration over multi-dimensional, and potentially 
complicated decision regions. In many cases, however, this can be avoided by 
simplifying the LRT to a test of the form 


where the test statistic £ = T(a:) is a sufficient statistic for the data. Such a 
simplified form is arrived at by modifying both sides of the LRT with 
montonically increasing transformations, and by algebraic simplifications. 
Since the modifications do not change the decision rule, we may calculate Pp 
and Pp in terms of the sufficient statistic. For example, the false-alarm 
probability may be written 

Equation: 


Pr = Prideclare#4| 
= f fo(t)dt 


where f(t) denotes the density of ¢ under @. Since t is typically of lower 
dimension than 2, evaluation of Pr and Pp can be greatly simplified. The key 
is being able to reduce the LRT to a threshold test involving a sufficient 
statistic for which we know the distribution. 


Example: 
Common Variances, Uncommon Means 


Let's design a Neyman-Pearson decision rule of size a for the 
problem 


KH, : a~ NV (0,071) 
UG eae ~ NV (pl, o71) 


where pz > 0, 0? > O are known, 0 = (0.. ‘Oe 


1 = (1...1)* are N-dimensional vectors, and I is the Nx N 
identity matrix. The likelihood ratio is 


Equation: 
fe , _ (en-1)? 
Qo2 
A a Tn Ino : 
(x) = 2 
N 1 Ss 
Tn V2n02 a 
(tn—")? 
= ene Ee 
ro 2 
e one oe 


N 
— ae = ast an p— pe 
N, N 
+5 (- £ +p eee zy) 


To simplify the test further we may apply the natural logarithm 
and rearrange terms to obtain 


Note: We have used the assumption p > 0. If ~ < 0, then 
division by p is not a monotonically increasing operation, and 
the inequalities would be reversed. 


The test statistic t is sufficient for the unknown mean. To set 
the threshold y, we write the false-alarm probability (size) as 


a= eae ol = / fo(t) dt 


To evaluate Pr, we need to know the density of ¢ under %. 
Fortunately, t is the sum of normal variates, so it is again 


normally distributed. In particular, we have t = Aa, where 
A=1' so 


t ~ W(AO, A (o*I)A*) = V(0, No”) 


under .#%. Therefore, we may write Pr in terms of the Q- 
function as 


ro He) 


The threshold is thus determined by 
y= VNoQ" (a) 
Under #4, we have 
t~ W(Al1, A (o7I)A*) = WY (Ny, No’) 


and so the detection probability (power) is 


Writing Pp as a function of Pr, the ROC curve is given by 


2p = o(orrn = 2M) 


The quantity iE is called the signal-to-noise ratio. As its 
name suggests, a larger SNR corresponds to improved 


performance of the Neyman-Pearson decision rule. 


Note:In the context of signal processing, the foregoing 
problem may be viewed as the problem of detecting a 
constant (DC) signal in additive white Gaussian noise: 


Go) a SN, 
(ses peated as 11g etal Deen A 


where A is a known, fixed amplitude, and w, ~ (0,07). 
Here A corresponds to the mean yp in the example. 


The Neyman-Pearson Lemma: General Case 


In our initial statement of the Neyman-Pearson Lemma, we assumed that for 
all n, the set {a, w| A(a) = 7} had probability zero under #. This 
eliminated many important problems from consideration, including tests of 
discrete data. In this section we remove this restriction. 


It is helpful to introduce a more general way of writing decision rules. Let y 
be a function of the data # with y(a) € [0, 1]. y defines the decision rule 
"declare #4, with probability y(a)." In other words, upon observing az, we 
flip a "p(a) coin." If it turns up heads, we declare #4; otherwise we declare 
4. Thus far, we have only considered rules with y(a) € {0, 1} 
Neyman-Pearson Lemma 


Consider the hypothesis testing problem 
KH: 2 ~ fo(a) 
A: a ~ fi(x) 


where fg and f; are both pdfs or both pmfs. Let a € [0, 1) be the size (false- 
alarm probability) constraint. The decision rule 


is the most powerful test of size a, where 7 and p are uniquely determined by 
requiring Pp = a. If a = 0, we take 7 = oo, p = 0. This test is unique up to 
sets of probability zero under #% and #%. 

When Pr|A(a) = 7] > 0 for certain 7, we choose 7 and p as follows: Write 


Pp = Pr[A(x) > yn] + p Pr[ A(x) = 7] 


Choose 77 such that 


Then choose p such that 


p Pr|A(x) = 7] = a — Pr[A(@) < 7] 


Example: 
Repetition Code 


Suppose we have a friend who is trying to transmit a bit (0 or 
1) to us over a noisy channel. The channel causes an error in 
the transmission (that is, the bit is flipped) with probability p, 
where 0 < p < 4, and pis known. In order to increase the 
chance of a successful transmission, our friend sends the same 
bit NV times. Assume the NV transmissions are statistically 
independent. Under these assumptions, the bits you receive are 
Bernoulli random variables: x, ~ Bernoulli (0). We are faced 
with the following hypothesis test: 


KH : 0 = p(0 sent) 
JE, : 0 = 1 -— p(1 sent) 


We decide to decode the received sequence x = (#1...2 ny) 
by designing a Neyman-Pearson rule. The likelihood ratio is 
Equation: 


Tin. =p)*"pl-*" 
Ae2) TIN, p?»(1—p)*" 
(1—p)*p¥-* 


where k = )~"_, x, is the number of 1s received. 


Note:k is a sufficient statistic for 0. 


The LRT is 


Taking the natural logarithm of both sides and rearranging, we 
have 

SN 1 Inf) 
Se aye 


k 


The false alarm probability is 
Equation: 


Pr = Prlk> | + pPrlk= | 
a an (7 Joka =p o( \ora =H! 


y and p are chosen so that Pr = a, as described above. 


The corresponding detection probability is 


Equation: 


Ey = lee a Salers =) 


Ebest (:) Oe C (1—p)"p" 


Problems 


Exercise: 


Problem: 


Design a hypothesis testing problem involving continous random 
variables such that Pr[A(x) = 7] > 0 for certain values of 7. Write down 
the false-alarm probability as a function of the threshold. Make as general 
a statement as possible about when the technical condition is satisfied. 


Exercise: 
Consider the scalar hypothesis testing problem 
KH: 2 ~ fo(x) 
KA 2 ~ fi(z) 
where 
Problem: f(x) = a a =40, 1} 


n (1+ (2-7) 


Write down the likelihood ratio test. 


Determine the decision regions as a function of 7, for all 7 > 0. Draw a 
representative of each. What are the "critical" values of 7? 


Note:There are five distinct cases. 


Compute the size and power (Pr and Pp) in terms of the threshold 7); 
and plot the ROC. 


Note: 


1 
———_—_ (7 — arctan 2 
eer @) 


Ral 
Suppose we decide to use a simple threshold test x 2 7 instead of the 
KH 


Neyman-Pearson rule. Does our performance # suffer much? Plot the 
ROC for this decision rule on the same graph as for the previous ROC. 


Exercise: 


Problem: 


Suppose we observe NV independent realizations of a Poisson random 
variable k with intensity parameter A: 
ee 

k! 


We must decide which of two intensities is in effect: 
KH A= Ao 
al A= AI 


where Ag < Aj. 


Write down the likelihood ratio test. 


Simplify the LRT to a test statistic involving only a sufficient statistic. 
Apply a monotonically increasing transformation to simplify further. 


Determine the distribution of the sufficient statistic under both 
hypotheses. 


Note: Use the characteristic function to show that a sum of IID Poisson 
variates is again Poisson distributed. 


Derive an expression for the probability of error. 


Assuming the two hypotheses are equally likely, and Ag = 5 and A; = 6, 
what is the minimum number WV of observations needed to attain a false- 
alarm probability no greater than 0.01? 


Note:If you have numerical trouble, try rewriting the log-factorial so as 
to avoid evaluating the factorial of large integers. 


Exercise: 


Problem: 


In [link], suppose p = 0.1. What is the smallest value of NV needed to 
ensure Pr < 0.01? What is Pp in this case? 


White Gaussian Noise 


By far the easiest detection problem to solve occurs when the noise vector consists 
of statistically independent, identically distributed, Gaussian random variables. In 
this book, a white sequence consists of statistically independent random variables. 
The white sequence's mean is usually taken to be zero [footnote] and each 
component's variance is 07. The equal-variance assumption implies the noise 
characteristics are unchanging throughout the entire set of observations. The 
probability density of the zero-mean noise vector evaluated at r — s; equals that of 
Gaussian random vector having independent components (.K = o7J) with mean s; 


L 
Pn (r — 3;) = ( ) e (are-)"r-29) 


2Qro2 


The resulting detection problem is similar to the Gaussian example examined so 
frequently in the hypothesis testing sections, with the distinction here being a non- 
zero mean under both models. The logarithm of the likelihood ratio becomes 


Ma 
(r — 8)” (r — 89) — (r — 81)* (r — 81) 2 207 n(n) 


Ma 
and the usual simplifications yield in 
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The quantities in parentheses express the signal processing operations for each 
model. If more than two signals were assumed possible, quantities such as these 
would need to be computed for each signal and the largest selected. This decision 
rule is optimum for the additive, white Gaussian noise problem. 

The zero-mean assumption is realistic for the detection problem. If the mean were 
non-zero, simply subtracting it from the observed sequence results in a zero-mean 
noise component. 


Each term in the computations for the optimum detector has a signal processing 
interpretation. When expanded, the term s;7 s; equals )~ a s;°(1), which is the 
signal energy E;. The remaining term - r7 s; - is the only one involving the 
observations and hence constitutes the sufficient statistic (7) for the additive 
white Gaussian noise detection problem. 


Yi(r) =r" 8; 


An abstract, but physically relevant, interpretation of this important quantity comes 
from the theory of linear vector spaces. There, the quantity r7 s; would be termed 
the dot product between r and s; or the projection of 7 onto s; . By employing 
the Schwarz inequality, the largest value of this quantity occurs when these vectors 
are proportional to each other. Thus, a dot product computation measures how 
much alike two vectors are: they are completely alike when they are parallel 
(proportional) and completely dissimilar when orthogonal (the dot product is zero). 
More precisely, the dot product removes those components from the observations 
which are orthogonal to the signal. The dot product thereby generalizes the familiar 
notion of filtering a signal contaminated by broadband noise. In filtering, the 
signal-to-noise ratio of a bandlimited signal can be drastically improved by lowpass 
filtering; the output would consist only of the signal and "in-band" noise. The dot 
product serves a similar role, ideally removing those "out-of-band" components 
(the orthogonal ones) and retaining the "in-band" ones (those parallel to the signal). 


Expanding the dot product, rs; = Sy"; a "'r(1)s;(1) another signal processing 
interpretation emerges. The dot product now describes a finite impulse response 
(FIR) filtering operation evaluated at a specific index. To demonstrate this 
interpretation, let h(J) be the unit-sample response of a linear, shift-invariant filter 
where h(l) = 0 for! < 0 and! > L. Letting r(J) be the filter's input sequence, the 
convolution sum expresses the output. 


k 


r(k)*h(k)= SY r(h(k—) 


l=k—(L—1) 


Letting k = L — 1, the index at which the unit-sample response's last value 
overlaps the input's value at the origin, we have 


| 
r(k)*h(k)|, 7-1 = => rh boL=l) 


[=0 


If we set the unit-sample response equal to the index-reversed, then delayed signal 
h(l) = s;(L — 1-1), we have 


r(k)*si(L—1—k)|,-1-1 = =S*r()s 


which equals the observation-dependent component of the optimal detector's 
sufficient statistic. [link] depicts these computations graphically. 
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The detector for signals contained in additive, white 
Gaussian noise consists of a matched filter, whose 
output is sampled at the duration of the signal and half 
of the signal energy is subtracted from it. The 
optimum detector incorporates a matched filter for 
each signal compares their outputs to determine the 
largest. 


The sufficient statistic for the i*" signal is thus expressed in signal processing 
notation as r(k)*s;(L —1—k)|,_, , — Si The filtering term is called a 
matched filter because the observations are passed through a filter whose unit- 
sample response "matches" that of the signal being sought. We sample the matched 
filter's output at the precise moment when all of the observations fall within the 
filter's memory and then adjust this value by half the signal energy. The adjusted 
values for the two assumed signals are subtracted and compared to a threshold. 


To compute the performance probabilities, the expressions should be simplified in 
the ways discussed in the hypothesis testing sections. As the energy terms are 
known a priori they can be incorporated into the threshold with the result 


I-1 
r(0) (si(2) ~ s0(0)) 2 2? Ia(n) + “1 "2 
1=0 yw 


The left term constitutes the sufficient statistic for the binary detection problem. 
Because the additive noise is presumed Gaussian, the sufficient statistic is a 
Gaussian random variable no matter which model is assumed. Under .G@; , the 
specifics of this probability distribution are 


L-1 
27 (810) — s0(2)) ~ W(y si(l) (1(2) — so(2)),07 S) (si() — +0? 
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The false-alarm probability is given by 


a? In(n) +z — d so(l) (81 (2) — s0(2)) 


o( 3 (s1(2) — s0(2))”) 


Pr=Q 
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The signal-related terms in the numerator of this expression can be manipulated 
with the false-alarm probability (and the detection probability) for the optimal 
white Gaussian noise detector succinctly expressed by 


Pr =Q In(n) + gor & (s1() ~ sol)" 


Note that the only signal-related quantity affecting this performance probability 
(and all of the others) is the ratio of energy in the difference signal to the noise 
variance. The larger this ratio, the better (smaller) the performance probabilities 
become. Note that the details of the signal waveforms do not greatly affect the 
energy of the difference signal. For example, consider the case where the two 
signal energies are equal (Hy) = EF, = E); the energy of the difference signal is 
given by 2E& — 25) so(l)si(l). The largest value of this energy occurs when the 
signals are negatives of each other, with the difference-signal energy equaling 4. 
Thus, equal-energy but opposite-signed signals such as sine waves, square-waves, 
Bessel functions, etc. all yield exactly the same performance levels. The essential 
signal properties that do yield good performance values are elucidated by an 
alternate interpretation. The term 5~ (s1(1) — so(L))* equals (|| $1 — 80 ||)”, the 
L? norm of the difference signal. Geometrically, the difference-signal energy is the 
same quantity as the square of the Euclidean distance between the two signals. In 


these terms, a larger distance between the two signals will mean better 
performance. 


Example: 

Detection, Gaussian example 

A common detection problem in array processing is to determine whether a signal 
is present (.@%,) or not (-@p ) in the array output. In this case, s9(J) = 0 The 
optimal detector relies on filtering the array output with a matched filter having an 
impulse response based on the assumed signal. Letting the signal under .4; be 
denoted simply by s(J), the optimal detector consists of 
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The false-alarm and detection probabilities are given by 


Pr=Q a 
Et 


Pp = a(2 1 = V2) 


[link] displays the probability of detection as a function of the signal-to-noise ratio 
5 for several values of false-alarm probability. Given an estimate of the expected 


signal-to-noise ratio, these curves can be used to assess the trade-off between the 
false-alarm and detection probabilities. 


Probability of Detection 


S/N (cB) 


The probability of detection is plotted 
versus signal-to-noise ratio for various 
values of the false-alarm probability Pp. 
False-alarm probabilities range from 107+ 
down to 10° by decades. The matched 
filter receiver was used since the noise is 
white and Gaussian. Note how the range of 
signal-to-noise ratios over which the 
detection probability changes shrinks as 
the false-alarm probability decreases. This 
effect is a consequence of the non-linear 
nature of the function Q(-). 


The important parameter determining detector performance derived in this example 
is the signal-to-noise ratio ae the larger it is, the smaller the false-alarm 


probability is (generally speaking). Signal-to-noise ratios can be measured in many 
different ways. For example, one measure might be the ratio of the rms signal 
amplitude to the rms noise amplitude. Note that the important one for the detection 
problem is much different. The signal portion is the sum of the squared signal 
values over the entire set of observed values - the signal energy; the noise portion 
is the variance of each noise component - the noise power. Thus, energy can be 
increased in two ways that increase the signal-to-noise ratio: the signal can be made 
larger or the observations can be extended to encompass a larger number of values. 


To illustrate this point, two signals having the same energy are shown in [link]. 
When these signals are shown in the presence of additive noise, the signal is visible 
on the left because its amplitude is larger; the one on the right is much more 
difficult to discern. The instantaneous signal-to-noise ratio-the ratio of signal 
amplitude to average noise amplitude - is the important visual cue. However, the 
kind of signal-to-noise ratio that determines detection performance belies the eye. 
The matched filter outputs have similar maximal values, indicating that total signal 
energy rather than amplitude determines the performance of a matched filter 
detector. 


Signal 


Signal + Noise 


Matched Filter Output 


-10 -10 


Two signals having the same energy are shown at the top 
of the figure. The one on the left equals one cycle of a 
sinusoid having ten samples/period ( sin(wol) with 
Wo = 270.1). On the right, ten cycles of similar signal is 
shown, with an amplitude a factor of V/10 smaller. The 
middle portion of the figure shows these signals with the 
same noise signal added; the duration of this signal is 200 
samples. The lower portion depicts the outputs of 
matched filters for each signal. The detection threshold 
was set by specifying a false-alarm probability of 10~°. 


Validity of the White Noise Assumption 


The optimal detection paradigm for the additive, white Gaussian noise problem has 
a relatively simple solution: construct FIR filters whose unit-sample responses are 
related to the presumed signals and compare the filtered outputs with a threshold. 
We may well wonder which assumptions made in this problem are most 
questionable in "real-world" applications. noise is additive in most cases. In many 
situation, the additive noise present in observed data is Gaussian. Because of the 
Central Limit Theorem, if numerous noise sources impinge on a measuring device, 
their superposition will be Gaussian to a great extent. As we know from the 
discussion on the Central Limit Theorem, glibly appealing to the Central Limit 
Theorem is not without hazards; the non-Gaussian detection problem will be 
discussed in some detail later. Interestingly, the weakest assumption is the 
"whiteness" of the noise. Note that the observation sequence is obtained as a result 
of sampling the sensor outputs. Assuming white noise samples does not mean that 
the continuous-time noise was white. White noise in continuous time has infinite 
variance and cannot be sampled; discrete-time white noise has a finite variance 
with a constant power spectrum. The Sampling Theorem suggests that a signal is 
represented accurately by its samples only if we choose a sampling frequency 
commensurate with the signal's bandwidth. One should note that fidelity of 
representation does not mean that the sample values are independent. In most 
cases, satisfying the Sampling Theorem means that the samples are correlated. As 
shown in Sampling and Random Sequences, the correlation function of sampled 
noise equals samples of the original correlation function. For the sampled noise to 
be white, E[n(1,T)n(l2T)| = 0 for 1, # 12: the samples of the correlation 
function at locations other than the origin must all be zero. While some correlation 
functions have this property, many examples satisfy the sampling theorem but 
do not yield uncorrelated samples. In many practical situations, undersampling 
the noise will reduce inter-sample correlation. Thus, we obtain uncorrelated 
samples either by deliberately undersampling, which wastes signal energy, or by 
imposing anti-aliasing filters that have a bandwidth larger than the signal and 
sampling at the signal's Nyquist rate. Since the noise power spectrum usually 
extends to higher frequencies than the signal, this intentional undersampling can 
result in larger noise variance. in either case, by trying to make the problem at hand 
match the solution, we are actually reducing performance! We need a direct 
approach to attacking the correlated noise issue that arises in virtually all sampled- 
data detection problems rather than trying to work around it. 


Elementary Hypothesis Testing 


In statistics, hypothesis testing is some times known as decision theory or simply testing. The key 
result around which all decision theory revolves is the likelihood ratio test. 


The Likelihood Ratio Test 


In a binary hypothesis testing problem, four possible outcomes can result. Model .@ did in fact 
represent the best model for the data and the decision rule said it was (a correct decision) or said it 
wasn't (an erroneous decision). The other two outcomes arise when model .@, was in fact true with 
either a correct or incorrect decision made. The decision process operates by segmenting the range 
of observation values into two disjoint decision regions S%9 and $8,. All values of r fall into either 
Ro or Ry. If a given r lies in Ro, for example, we will announce our decision ""model .4 was 
true"; if in R,, model .4, would be proclaimed. To derive a rational method of deciding which 
model best describes the observations, we need a criterion to assess the quality of the decision 


process. Optimizing this criterion will specify the decision regions. 


The Bayes’ decision criterion seeks to minimize a cost function associated with making a decision. 
Let Cj; be the cost of mistaking model 7 for model 7 (¢ 4 7) and C;; the presumably smaller cost of 
correctly choosing model 7: C;; > Ci, 1 A j. Let 7; be the a priori probability of model 7. The so- 


called Bayes' cost C is the average cost of making a decision. 
Equation: 
C= Vistiy Cismj Prlsay M when H; true| 
igs} C37; Prisay 4; | H; true] 


The Bayes' cost can be expressed as 

Equation: 

C= igi} Cit Pr[r CR; | Mo true] 

Listig Cus J Priam, (vr) dr 

J Coot Prj.m (7) + Com Prim (r)d r+ f Crom Prim (7) + Cum Prim (r) dr 
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Pr|_a, (7) is the conditional probability density function of the observed data r given that model 
M,, was true. To minimize this expression with respect to the decision regions 9 and $%1, ponder 
which integral would yield the smallest value if its integration domain included a specific 
observation vector. This selection process defines the decision regions; for example, we choose 4% 
for those values of r which yield a smaller value for the first integral. 


T C00 Pr|-M (7) + 71Co1 Prim (7) < ToC10 Prim (7) + 710 Prim (7) 


We choose .@ when the inequality is reversed. This expression is easily manipulated to obtain the 
decision rule known as the likelihood ratio test. 
Equation: 


The comparison relation means selecting model ./ if the left-hand ratio exceeds the value on the 
Pra (") 
Pp, |M& (r) 
A(r), is computed from the observed value of r and then compared with a threshold 7 equaling 
to(C19—Coo) 
m™1(Co1—C11) * 
expressed as the comparison of the likelihood ratio with a threshold. 
Equation: 


right; otherwise, Zp is selected. Thus, the likelihood ratio 


symbolically represented by 


Thus, when two models are hypothesized, the likelihood ratio test can be succinctly 


M, 
A(r) 2 n 
Me 


aaa . Furthermore, 
note that only the value of the likelihood ratio relative to the threshold matters; to simplify the 
computation of the likelihood ratio, we can perform any positively monotonic operations 
simultaneously on the likelihood ratio and the threshold without affecting the comparison. We can 
multiply the ratio by a positive constant, add any constant, or apply a monotonically increasing 
function which simplifies the expressions. We single one such function, the logarithm, because it 
simplifies likelihood ratios that commonly occur in signal processing applications. Known as the 
log-likelihood, we explicitly express the likelihood ratio test with it as 

Equation: 


The data processing operations are captured entirely by the likelihood ratio 


Useful simplifying transformations are problem-dependent; by laying bare that aspect of the 
observations essential to the model testing problem, we reveal the sufficient statistic (1): the 
scalar quantity which best summarizes the data (Lehmann, pp. 18-22). The likelihood ratio test is 
best expressed in terms of the sufficient statistic. 

Equation: 


My, 
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We will denote the threshold value by y when the sufficient statistic is used or by 7 when the 
likelihood ratio appears prior to its reduction to a sufficient statistic. 


As we shall see, if we use a different criterion other than the Bayes' criterion, the decision rule often 
involves the likelihood ratio. The likelihood ratio is comprised of the quantities p,_, (r), termed 
the likelihood function, which is also important in estimation theory. It is this conditional density 
that portrays the probabilistic model describing data generation. The likelihood function completely 


characterizes the kind of "world" assumed by each model; for each model, we must specify the 
likelihood function so that we can solve the hypothesis testing problem. 


A complication, which arises in some cases, is that the sufficient statistic may not be monotonic. If 
monotonic, the decision regions 9 and $4; are simply connected (all portions of a region can be 
reached without crossing into the other region). If not, the regions are not simply connected and 
decision region islands are created (see this problem). Such regions usually complicate calculations 
of decision performance. Monotonic or not, the decision rule proceeds as described: the sufficient 
statistic is computed for each observation vector and compared to a threshold. 


Example: 

An instructor in a course in detection theory wants to determine if a particular student studied for 
his last test. The observed quantity is the student's grade, which we denote by r. Failure may not 
indicate studiousness: conscientious students may fail the test. Define the models as 


¢ ™M: did not study 
e« M,: did study 


The conditional densities of the grade are shown in [link]. 


Conditional densities for the grade 
distributions assuming that a student did 
not study (.@) or did (4) are shown in 
the top row. The lower portion depicts the 
likelihood ratio formed from these 
densities. 


Based on knowledge of student behavior, the instructor assigns a priori probabilities of 7 = 1/4 
and 7, = 3/4. The costs C;; are chosen to reflect the instructor's sensitivity to student feelings: 
C1 = 1 = Co (an erroneous decision either way is given the same cost) and Cog = 0 = C1. The 
likelihood ratio is plotted in [link] and the threshold value 7, which is computed from the a priori 
probabilities and the costs to be 1/3, is indicated. The calculations of this comparison can be 
simplified in an obvious way. 


eis 
50 


or 


M, 
r > 50/3 = 16.7 
Me 


The multiplication by the factor of 50 is a simple illustration of the reduction of the likelihood ratio 
to a sufficient statistic. Based on the assigned costs and a priori probabilities, the optimum decision 
rule says the instructor must assume that the student did not study if the student's grade is less than 
16.7; if greater, the student is assumed to have studied despite receiving an abysmally low grade 
such as 20. Note that as the densities given by each model overlap entirely: the possibility of 
making the wrong interpretation always haunts the instructor. However, no other procedure will be 
better! 


The Bayes Risk Criterion in Hypothesis Testing 


The design of a hypothesis test/detector often involves constructing the solution 
to an optimization problem. The optimality criteria used fall into two classes: 
Bayesian and frequent. 


In the Bayesian setup, it is assumed that the a priori probability of each 
hypothesis occuring (7;) is known. A cost C}j is assigned to each possible 
outcome: 


Ci; = Pr|sayH;whenH ;true| 


The optimal test/detector is the one that minimizes the Bayes risk, which is 
defined to be the expected cost of an experiment: 
C= S- Ci; Prisay H;whenH true] 
1,J 


In the event that we have a binary problem, and both hypotheses are simple, the 
decision rule that minimizes the Bayes risk can be constructed explicitly. Let us 
assume that the data is continuous (i.e., it has a density) under each hypothesis: 


Ho 7:a~ fo(x) 
Ay ars baad fi(x) 


Let Ro and R, denote the decision regions corresponding to the optimal test. 
Clearly, the optimal test is specified once we specify Ro and R, = Ro’. 


The Bayes risk may be written 
Equation: 


C = yaar Cin [ ia) dz 
Jf Cootofo(#) + Cormfi(a) da + f Cromofo(x) + Cumfi(a) dx 


Recall that Ro and R; partition the input space: they are disjoint and their union 
is the full input space. Thus, every possible input z belongs to precisely one of 
these regions. In order to minimize the Bayes risk, a measurement x should 
belong to the decision region R; for which the corresponding integrand in the 


preceding equation is smaller. Therefore, the Bayes risk is minimized by 
assigning x to Ro whenever 


ToCoofo(x) + mCoifi(@) < moCiofo(#) + mCufi(x) 


and assigning x to R, whenever this inequality is reversed. The resulting rule 
may be expressed concisely as 


_ file) Bh 19 (Cio — Coo) _ 
~ f(z) ® m(Cu—Cu) 


A(a) 


Here, A(a) is called the likelihood ratio, 7) is called the threshold, and the 
overall decision rule is called the Likelihood Ratio Test (LRT). The expression 
on the right is called a threshold. 


Example: 

An instructor in a course in detection theory wants to determine if a particular 
student studied for his last test. The observed quantity is the student's grade, 
which we denote by 7. Failure may not indicate studiousness: conscientious 
students may fail the test. Define the models as 


e —_g: did not study 
e 1: did study 


The conditional densities of the grade are shown in [link]. 
Py | Hy | Ho} 


Conditional densities for the grade 
distributions assuming that a student did 
not study( 9)ordid( 1) areshown in 
the top row. The lower portion depicts the 
likelihood ratio formed from these 
densities. 


Based on knowledge of student behavior, the instructor assigns a priori 
probabilities of 79 = + and 77, = + The costs C;; are chosen to reflect the 
instructor's sensitivity to student feelings: Co, = 1 = C9 (an erroneous 
decision either way is given the same cost) and C99 = 0 = Cj,. The likelihood 
ratio is plotted in [link] and the threshold value 7, which is computed from the a 
priori probabilities and the costs to be + is indicated. The calculations of this 


comparison can be simplified in an obvious way. 


1 
50 - 3 
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r> ae = Hi 


0 


The multiplication by the factor of 50 is a simple illustration of the reduction of 
the likelihood ratio to a sufficient statistic. Based on the assigned costs and a 
priori probabilities, the optimum decision rule says the instructor must assume 
that the student did not study if the student's grade is less than 16.7; if greater, 
the student is assumed to have studied despite receiving an abysmally low grade 
such as 20. Note that as the densities given by each model overlap entirely: the 
possibility of making the wrong interpretation always haunts the instructor. 
However, no other procedure will be better! 


A special case of the minimum Bayes risk rule, the minimum probability of error 
rule, is used extensively in practice, and is discussed at length in another module. 


Problems 


Exercise: 


Problem: 


Denote a = Pr{declareH;whenHotrue] and 


6 = PrideclareH,whenH; true]. Express the Bayes risk C in terms of a 
and 8, C;;, and 7;. Argue that the optimal decision rule is not altered by 
setting Coo = C11 = 0. 


Exercise: 
Problem: 


Suppose we observe a such that A(a) = 7. Argue that it doesn't matter 
whether we assign zw to Ro or Ry. 


Sufficient Statistics 


Introduction 


Sufficient statistics arise in nearly every aspect of statistical inference. It is 
important to understand them before progressing to areas such as 
hypothesis testing and parameter estimation. 


Suppose we observe an V-dimensional random vector X, characterized by 
the density or mass function fg (a), where @ is a p-dimensional vector of 
parameters to be estimated. The functional form of f (a) is assumed known. 
The parameter 90 completely determines the distribution of X. Conversely, 
a measurement a of X provides information about @ through the 
probability law fg (2). 


Example: 


xX 
Suppose X = ey , where X; ~ V(0, 1) are IID. Here @ is a scalar 
2 


parameter specifying the mean. The distribution of X is determined by 0 
through the density 


1 ce eee __ (22-0)? 
fy (x) = ——e 
/ 24 V/ 20 


100 
On the other hand, if we observe x = ( 


rel , then we may safely assume 


0 = 0 is highly unlikely. 


The N-dimensional observation X carries information about the p- 
dimensional parameter vector @. If p < N, one may ask the following 
question: Can we compress 2 into a low-dimensional statistic without any 
loss of information? Does there exist some function £ = Ta), where the 


dimension of t is MZ < NN, such that t carries all the useful information 
about 0? 


If so, for the purpose of studying @ we could discard the raw measurements 
x and retain only the low-dimensional statistic £. We call ¢ a sufficient 
statistic. The following definition captures this notion precisely: 


Let Xj,..., X.¢ be arandom sample, governed by the density or 

probability mass function f (a |@). The statistic Ta) is sufficient for 
6 if the conditional distribution of x, given T(a) = t, is independent 
of @. Equivalently, the functional form of fg), (a) does not involve 0. 


How should we interpret this definition? Here are some possibilities: 


1. Let fg (a, €) denote the joint density or probability mass function on 
(X,7T(X)). If T(X) is a sufficient statistic for 8, then 
Equation: 


fo (a) = fo (x, T(x)) 


fo\z (x) fo (t) 
= f(a|t) fo (t) 


Therefore, the parametrization of the probability law for the measurement z 
is manifested in the parametrization of the probability law for the statistic 
Ta), 


2. Given t = T(a), full knowledge of the measurement 2 brings no 
additional information about @. Thus, we may discard a and retain on the 
compressed statistic f. 


3. Any inference strategy based on fg (a) may be replaced by a strategy 
based on fg (t). 


Example: 
Binary Information Source 


(Scharf, pp.78) Suppose a binary information source emits a 
sequence of binary (0 or 1) valued, independent variables 
£1,---,xN. Each binary symbol may be viewed as a 
realization of a Bernoulli trial: z,, ~ Bernoulli (6), iid. The 
parameter 6 € [0, 1] is to be estimated. 


The probability mass function for the random sample 
bs (ay. a ay)" is 


Equation: 
N N 
fg (x) = || fo (en) [[ *G-9)*" 
n=1 n=1 


where k = ee Xp, is the number of 1's in the sample. 
We will show that k is a sufficient statistic for x. This will 
entail showing that the conditional probability mass 
function fg), (a) does not depend on 8. 


The distribution of the number of ones in NV independent 
Bernoulli trials is binomial: 


fy (k) = (.) oy = 


Next, consider the joint distribution of (a, 5) 2,). We have 


Hy ()) (« Ys] 


Thus, the conditional probability may be written 
Equation: 


fol, (@) = ne 


This shows that k is indeed a sufficient statistic for 0. The 
N values 21,...,y can be replaced by the quantity k 
without losing information about 0. 


Exercise: 
Problem: 
In the previous example, suppose we wish to store in memory the 


information we possess about 8. Compare the savings, in terms of bits, 
we gain by storing the sufficient statistic k instead of the full sample 


L1,+++ UN. 
Determining Sufficient Statistics 


In the example above, we had to guess the sufficient statistic, and work out 
the conditional probability by hand. In general, this will be a tedious way to 


go about finding sufficient statistics. Fortunately, spotting sufficient 
statistics can be made easier by the Fisher-Neyman Factorization Theorem. 


Uses of Sufficient Statistics 


Sufficient statistics have many uses in statistical inference problems. In 
hypothesis testing, the Likelihood Ratio Test can often be reduced to a 
sufficient statistic of the data. In parameter estimation, the Minimum 
Variance Unbiased Estimator of a parameter @ can be characterized by 
sufficient statistics and the Rao-Blackwell Theorem. 


Minimality and Completeness 


Minimal sufficient statistics are, roughly speaking, sufficient statistics that 
cannot be compressed any more without losing information about the 
unknown parameter. Completeness is a technical characterization of 
sufficient statistics that allows one to prove minimality. These topics are 
covered in detail in this module. 


Further examples of sufficient statistics may be found in the module on the 
Fisher-Neyman Factorization Theorem. 


Detection in the Presence of Unknowns 


We assumed in the previous sections that we have a few well-specified 
models (hypotheses) for a set of observations. These models were 
probabilistic; to apply the techniques of statistical hypothesis testing, the 
models take the form of conditional probability densities. In many 
interesting circumstances, the exact nature of these densities may not be 
known. For example, we may know a priori that the mean is either zero or 
some constant (as in the Gaussian example). However, the variance of the 
observations may not be known or the value of the non-zero mean may be 
in doubt. In an array processing context, these respective situations could 
occur when the background noise level is unknown (a likely possibility in 
applications) or when the signal amplitude is not known because of far-field 
range uncertainties (the further the source of propagating energy, the 
smaller its received energy at each sensor). In an extreme case, we can 
question the exact nature of the probability densities (everything is not 
necessarily Gaussian!). The model evaluation problem can still be posed for 
these situations; we classify the "unknown" aspects of a model testing 
problem as either parametric (the variance is not known, for example) or 
nonparametric (the formula for the density is in doubt). The former 
situation has a relatively long history compared to the latter; many 
techniques can be used to approach parametric problems while the latter is a 
subject of current research (Gibson and Melsa). We concentrate on 
parametric problems here. 


We describe the dependence of the conditional density on a set of 
parameters by incorporating the parameter vector @ as part of the condition. 
We write the likelihood function as p,|_4 (7) for the parametric problem. 
In statistics, this situation is said to be a composite hypothesis (Cramer). 
Such situations can be further categorized according to whether the 
parameters are random or nonrandom. For a parameter to be random, we 
have an expression for its a priori density, which could depend on the 
particular model. As stated many times, a specification of a density usually 
expresses some knowledge about the range of values a parameter may 
assume and the relative probability of those values. Saying that a parameter 
has a uniform distribution implies that the values it assumes are equally 
likely, not that we have no idea what the values might be and express this 


ignorance by a uniform distribution. If we are ignorant of the underlying 
probability distribution that describes the values of a parameter, we will 
characterize them simply as being unknown (not random). Once we have 
considered the random parameter case, nonrandom but unknown parameters 
will be discussed. 


Random Parameters 


When we know the density of 0, the likelihood function can be expressed as 
Pra, (7) = / Pr\.ao (7) pe (8) d@ 


and the likelihood ratio in the random parameter case becomes 


_ J Primo (7) pe (0) de 
J Pr\.ao (7) pe (8) de 
Unfortunately, there are many examples where either the integrals involved 


are intractable or the sufficient statistic is virtually the same as the 
likelihood ratio, which can be difficult to compute. 


A(r) 


Example: 

A simple, but interesting, example that results in a computable answer 
occurs when the mean of Gaussian random variables is either zero (model 
0) or is +(m) with equal probability (hypothesis 1). The second 
hypothesis means that a non-zero mean is present, but its sign is not 
known. We are therefore stating that if hypothesis one is in fact valid, the 
mean has fixed sign for each observation - what is random is its a priori 
value. As before, L statistically independent observations are made. 
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The numerator of the likelihood ratio is the sum of two Gaussian densities 
weighted by 1/2 (the a priori probability values), one having a positive 
mean, the other negative. The likelihood ratio, after simple cancellation of 
common terms, becomes 


1 25 Fe rj —Lm2 if 2m) 0 rj—Lm 
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and the decision rule takes the form 
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where cosh(«) is the hyperbolic cosine given simply as 2—. As this 
quantity is an even function, the sign of its argument has no effect on the 
result. The decision rule can be written more simply as 
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The sufficient statistic equals the magnitude of the sum of the 
observations in this case. While the right side of this expression, which 
equals yy, is complicated, it need only be computed once. Calculation of the 
performance probabilities can be complicated; in this case, the false-alarm 
probability is easy to find and the others more difficult. 


Non-Random Parameters 


In those cases where a probability density for the parameters cannot be 
assigned, the model evaluation problem can be solved in several ways; the 
methods used depend on the form of the likelihood ratio and the way in which 
the parameter(s) enter the problem. In the Gaussian problem we have 
discussed so often, the threshold used in the likelihood ratio test 7 may be 
unity. In this case, examination of the resulting computations required reveals 
that implementing the test does not require knowledge of the variance of 
the observations (see this problem). Thus, if the common variance of the 
underlying Gaussian distributions is not known, this lack of knowledge has no 
effect on the optimum decision rule. This happy situation - knowledge of the 
value of a parameter is not required by the optimum decision rule - occurs 
rarely, but should be checked before using more complicated procedures. 


A second fortuitous situation occurs when the sufficient statistic as well as its 
probability density under one of the models do not depend on the unknown 
parameter(s). Although the sufficient statistic's threshold -y expressed in terms 
of the likelihood ratio's threshold 7 depends on the unknown parameters, y 
may be computed as a single value using the Neyman-Pearson criterion if the 
computation of the false-alarm probability does not involve the unknown 
parameters. 


Example: 

Continuing the example of the previous section, let's consider the situation 
where the value of the mean of each observation under model .Z; is not 
known. The sufficient statistic is the sum of the observations (that quantity 
doesn't depend on m) and the distribution of the observation vector under 
model .4, does not depend on m (allowing computation of the false-alarm 
probability). However, a subtlety emerges; in the derivation of the sufficient 
Statistic, we had to divide by the value of the mean. The critical step occurs 
once the logarithm of the likelihood ratio is manipulated to obtain 


Recall that only positively monotonic transformations can be applied; if a 
negatively monotonic operation is applied to this inequality (such as 
multiplying both sides by -1), the inequality reverses. If the sign of m is 
known, it can be taken into account explicitly and a sufficient statistic results. 
If, however, the sign is not known, the above expression cannot be 
manipulated further and the left side constitutes the sufficient statistic for this 
problem. The sufficient statistic then depends on the unknown parameter and 
we cannot develop a decision rule in this case. If the sign is known, we can 
proceed. Assuming the sign of m is positive, the sufficient statistic is the sum 
of the observations and the threshold ¥ is found by 


y = VLoQ"1(Pr) 


Note that if the variance o? instead of the mean were unknown, we could not 
compute the threshold. The difficulty lies not with the sufficient statistic (it 
doesn't depend on the variance), but with the false alarm probability as the 
expression indicates. Another approach is required to deal with the unknown- 
variance problem. 


When this situation occurs - the sufficient statistic and the false-alarm 
probability can be computed without needing the parameter in question, we 
have established what is known as a uniformly most powerful test (or UMP 
which can only be demonstrated by explicitly finding the sufficient statistic 
and evaluating its probability distribution, then the composite hypothesis 
testing problem cannot be solved without some value for the parameter being 
used. 


This seemingly impossible situation - we need the value of the parameter that 
is assumed unknown - can be approached by noting that some data is available 
for "guessing" the value of the parameter. If a reasonable guess could be 
obtained, it could then be used in our model evaluation procedures developed 
in this chapter. The data available for estimating unknown parameters are 
precisely the data used in the decision rule. Procedures intended to yield 
"good" guesses of the value of a parameter are said to be parameter 
estimates. Estimation procedures are the topic of the next chapter; there we 
will explore a variety of estimation techniques and develop measure of 


estimate quality. For the moment, these issues are secondary; even if we knew 
the size of the estimation error, for example, the more pertinent issue is how 
the imprecise parameter value affects the performance probabilities. We can 
compute these probabilities without explicitly determining the estimate's error 
characteristics. 


One parameter estimation procedure that fits nicely into the composite 
hypothesis testing problem is the maximum likelihood estimate. [footnote] 
Letting r denote the vector of observables and @ a vector of parameters, the 


maximum likelihood estimate of 0, Oz, is that value of 0 that maximizes the 
conditional density p,.\g (r) of the observations given the parameter values. 


To use Over, in our decision rule, we estimate the parameter vector separately 
for each model, use the estimated value in the conditional density of the 
observations, and compute the likelihood ratio. This procedure is termed the 
generalized likelihood ratio test for the unknown parameter problem in 
hypothesis testing (Lehmann; p.16), (van Trees;_p.92ff). 

Equation: 


maxg {6, Pr|.“40 (r)} 


A(r) = 
( ) maxg {6, Pr|.40 (r)} 


Note that we do not find that value of the parameter that (necessarily) 
maximizes the likelihood ratio. Rather, we estimate the parameter value most 
consistent with the observed data in the context of each assumed model 
(hypothesis) of data generation. In this way, the estimate conforms with each 
potential model rather than being determined by some amalgam of supposedly 
mutually exclusive models. 

The maximum likelihood estimation procedure and its characteristics are fully 
described in this section. 


Example: 
Returning to our Gaussian example, assume that the variance a” is known but 
that the mean under .4; is unknown. 


My: r~ (0, 07) 


The unknown quantity occurs only in the exponent of the conditional density 
under .@,; to maximize this density, we need only to maximize the exponent. 
Thus, we consider the derivative of the exponent with respect to m. 
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The solution of this equation is the average value of the observations 


To derive the decision rule, we substitute this estimate in the conditional 
density for 4. The critical term, the exponent of this density, is manipulated 
to obtain 
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Noting that the first term in this exponent is identical to the exponent of the 
denominator in the likelihood ratio, the generalized likelihood ratio becomes 


Maem ae) 


The sufficient statistic thus becomes the square (or equivalently the 
magnitude) of the summed observations. Compare this result with that 
obtained in [link]. There, an UMP test existed if we knew the sign of m and 
the sufficient statistic was the sum of the observations. Here, where we 
employed the generalized likelihood ratio test, we made no such assumptions 


about ™; this generality accounts for the difference in sufficient statistic. 
Which test do you think would lead to a greater detection probability for a 
given false-alarm probability? 


Once the generalized likelihood ratio is determined, we need to determine the 
threshold. If the a priori probabilities 79 and 7, are known, the evaluation of 
the threshold proceeds in the usual way. If they are not known, all of the 
conditional densities must not depend on the unknown parameters lest the 
performance probabilities also depend upon them. In most cases, the original 
model evaluation problem is posed in such a way that one of the models does 
not depend on the unknown parameter; a criterion on the performance 
probability related to that model can then be established via the Neyman- 
Pearson procedure. If not the case, the threshold cannot be computed and the 
threshold must be set experimentally: we force one of the models to be true 
and modify the threshold on the sufficient statistic until the desired level of 
performance is reached. Despite this non-mathematical approach, the overall 
performance of the model evaluation procedure will be optimum because of 


the results surrounding the Neyman-Pearson criterion. 
I-p 
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The two-model 
testing problem 
can be abstractly 
described as a 
communication 
channel where the 
inputs are the 
models and the 
outputs are the 
decisions. The 
transition 


probabilities are 
related to the 
false-alarm (Pr) 
and detection (Pp 
) probabilities. 


Spectral Detection 


From the results presented in the previous sections, the colored noise problem was found to be 
pervasive, but required a computationally difficult detector. The simplest detector structure 
occurs when the additive noise is white; this notion leads to the idea of whitening the 
observations, thereby transforming the data into a simpler form (as far as detection theory is 
concerned). However, the required whitening filter is often time-varying and can have a long- 
duration unit-sample response. Other, more computationally expedient, approaches to whitening 
are worth considering. An only slightly more complicated detection problem occurs when we 
have a diagonal noise covariance matrix, as in the white noise case, but unequal values on the 
diagonal. In terms of the observations, this situation means that they are contaminated by noise 
having statistically independent, but unequal variance components: the noise would thus be non- 
stationary. Few problems fall directly into this category; however, the colored noise problem can 
be recast into the white, unequal-variance problem by calculating the discrete Fourier Transform 
(DFT) of the observations and basing the detector on the resulting spectrum. The resulting 
spectral detectors greatly simplify detector structures for discrete-time problems if the 
qualifying assumptions described in sequel hold. 


Let W be the so-called Z x LZ "DEFT matrix" 
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where W is the elementary complex exponential e~ (7), The discrete Fourier Transform of the 
sequence r(J), usually written as R(k) = S0/G) r(le~ (‘**) , can be written in matrix form as 


R = Wr. To analyze the effect of evaluating the DFT of the observations, we describe the 
computations in matrix form for analytic simplicity. The first critical assumption has been made: 
take special note that the length of the transform equals the duration of the observations. In 
many signal processing applications, the transform length can differ from the data length, being 
either longer or shorter. The statistical properties developed in the following discussion are 
critically sensitive to the equality of these lengths. The covariance matrix Kr of R is given by 
WK,w#. Symmetries of these matrices - the Vandermonde form of W and the Hermitian, 
Toeplitz form of K,. - leads to many simplifications in evaluating this product. The entries no 
the main diagonal are given by [footnote] 
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The variance of the k*” term in the discrete Fourier Transform of the noise thus equals the 
discrete Fourier Transform of the windowed covariance function. This window has a triangular 
shape; colloquially termed the "rooftop" window, its technical name is the Bartlett window and 


it occurs frequently in array processing and spectral estimation. We have found that the variance 
equals the smoothed noise power spectrum evaluated at a particular frequency. The off-diagonal 
terms of Kp are not easily written; the complicated result is 

Equation: 


4 kimkot] ory ( tka ke) 

1 sin( Tr ( Hee )) 
. 1(ki—k) 
sin( HD) 


The complex exponential terms indicate that each off-diagonal term consists of the sum of two 
Fourier Transforms: one at the frequency index kz and the other negative index —k,. In 
addition, the transform is evaluated only over non-negative lags. The transformed quantity again 
equals a windowed version of the noise covariance function, but with a sinusoidal window 
whose frequency depends on the indices k; and ky. This window can be negative-valued! In 
contrast to the Bartlett window encountered in evaluating the on-diagonal terms, the maximum 
value achieved by the window is not large ED compared to ZL). Furthermore, this 
sin eo 
window is always zero at the origin, the location of the maximum value of any covariance 
function. The largest magnitudes of the off-diagonal terms tend to occur when the indices ky 
and kg are nearly equal. Let their difference be one; if the covariance function of the noise tends 
toward zero well within the number of observations, L, then the Bartlett window has little effect 
on the covariance function while the sinusoidal window greatly reduces it. This condition on the 
covariance function can be interpreted physically: the noise in this case is wideband and any 
correlation between noise values does not extend over significant portion of the observation 
record. On the other hand, if the width of the covariance function is comparable to J, the off- 
diagonal terms will be significant. This situation occurs when the noise bandwidth is smaller 
than or comparable to the reciprocal of the observation interval's duration. This condition on the 
duration of the observation interval relative to the width of the noise correlation function forms 
the second critical assumption of spectral detection. The off-diagonal terms will thus be much 
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The curious index / + 1 on the matrix arises because rows and columns of matrices are 
traditionally indexed beginning with one instead of zero. 


In the simplest case, the covariance matrix of the discrete Fourier Transform of the observations 
can be well approximated by a diagonal matrix. 
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The non-zero components o;” of this matrix constitute the noise power spectrum at the various 
frequencies. The signal component of the transformed observations FR is represented by S;, the 


DFT of the signal s;, while the noise component has this diagonal covariance matrix structure. 


Note: In the frequency domain, the colored noise problem can be approximately converted to a 
white noise problem where the components of the noise have unequal variances. 


To recap, the critical assumptions of spectral detection are 


¢ The transform length equals that of the observations. In particular, the observations cannot 
be "padded" to force the transform length to equal a "nice" number (like a power of two). 

¢ The noise's correlation structure should be much less than the duration of the observations. 
Equivalently, a narrow correlation function means the corresponding power spectrum 
varies slowly with frequency. If either condition fails to hold, calculating the Fourier 
Transform of the observations will not necessarily yield a simpler noise covariance matrix. 


The optimum spectral detector computes, for each possible signal, the quantity 
H -1 
R(REK pr Si) = Soka Ss [footnote]. Because of the covariance matrix's simple form, this 


sufficient statistic for the spectral detection problem has the simple form 
Equation: 
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Each term in the dot product between the discrete Fourier Transform of the observations and the 
signal is weighted by the reciprocal of the noise power spectrum at that frequency. This 
computation is much simpler than the equivalent time domain version and, because of 
algorithms such as the fast Fourier Transform, the initial transformation (the multiplication by 
W or the discrete Fourier Transform) can be evaluated expeditiously. 

The real part in the statistic emerges because R and S; are complex quantities. 


Sinusoidal signals are particularly well-suited to the spectral detection approach. If the signal's 
frequency equals one of the analysis frequencies in the Fourier Transform (wo = — for some 
k), then the sequence $;(k) is non-zero only at this frequency index, only one term in the 
sufficient statistic's summation need be computed, and the noise power is no longer explicitly 
needed by the detector (it can be merged into the threshold). 
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If the signal's frequency does not correspond to one of the analysis frequencies, spectral energy 
will be maximal at the nearest analysis frequency but will extend to nearby frequencies also. 
This effect is termed "leakage" and has been well studied. Exact formulation of the signal's DFT 


is usually complicated in this case; approximations which utilize only the maximal-energy 
frequency component will be sub-optimal (i.e., yield a smaller detection probability). The 
performance reduction may be small, however, justifying the reduced amount of computation. 


Introduction to Estimation Theory 


In searching for methods of extracting information from noisy observations, 
this chapter describes estimation theory, which has the goal of extracting 
from noise-corrupted observations the values of disturbance 
parameters (noise variance, for example), signal parameters (amplitude 
or propagation direction), or signal waveforms. Estimation theory 
assumes that the observations contain an information-bearing quantity, 
thereby tacitly assuming that detection-based preprocessing has been 
performed (in other words, do I have something in the observations worth 
estimating?). Conversely, detection theory often requires estimation of 
unknown parameters: Signal presence is assumed, parameter estimates are 
incorporated into the detection statistic, and consistency of observations and 
assumptions tested. Consequently, detection and estimation theory form a 
symbiotic relationship, each requiring the other to yield high-quality signal 
processing algorithms. 


Despite a wide variety of error criteria and problem frameworks, the 
optimal detector is characterized by a single result: the likelihood ratio test. 
Surprisingly, optimal detectors thus derived are usually easy to implement, 
not often requiring simplification to obtain a feasible realization in 
hardware or software. In contrast to detection theory, no fundamental result 
in estimation theory exists to be summoned to attack the problem at hand. 
The choice of error criterion and its optimization heavily influences the 
form of the estimation procedure. Because of the variety of criterion- 
dependent estimators, arguments frequently rage about which of several 
optimal estimators is "better." Each procedure is optimum for its assumed 
error criterion; thus, the argument becomes which error criterion best 
describes some intuitive notion of quality. When more ad hoc, noncriterion- 
based procedures[footnote] are used, we cannot assess the quality of the 
resulting estimator relative to the best achievable. As shown later, bounds 
on the estimation error do exist, but their tightness and applicability to a 
given situation are always issues in assessing estimator quality. At best, 
estimation theory is less structured than detection theory. Detection is 
science, estimation art. Inventiveness coupled with an understanding of the 
problem (what types of errors are critically important, for example) are key 


elements to deciding which estimation procedure "fits" a given problem 
well. 
This governmentese phrase concisely means guessing. 


Terminology in Estimation Theory 


More so than detection theory, estimation theory relies on jargon to 
characterize the properties of estimators. Without knowing any estimation 
technique, let's use parameter estimation as our discussion prototype. The 
parameter estimation problem is to determine from a set of L observations, 
represented by the L-dimensional vector 7, the values of parameters 
denoted by the vector 8. We write the estimate of this parameter vector as 


A(r), where the "hat" denotes the estimate, and the functional dependence 
on r explicitly denotes the dependence of the estimate on the observations. 
This dependence is always present[ footnote], but we frequently denote the 
estimate compactly as 6. Because of the probabilistic nature of the problems 
considered in this chapter, a parameter estimate is itself a random vector, 
having its own statistical characteristics. The estimation error ¢(7) equals 
the estimate minus the actual parameter value: e(r) = 0(r) — 6. It too isa 
random quantity and is often used in the criterion function. For example, 
the mean-squared error is given by & le*e| ; the minimum mean-squared 
error estimate would minimize this quantity. The mean-squared error matrix 
is EB lee"| ; on the main diagonal, its entries are the mean-squared 
estimation errors for each component of the parameter vector, whereas the 
off-diagonal terms express the correlation between the errors. The mean- 
squared estimation error / le*e| equals the trace of the mean-squared 
error matrix tr (E lee"| 1, 

Estimating the value of a parameter given no data may be an interesting 
problem in clairvoyance, but not in estimation theory. 


Bias 


An estimate is said to be unbiased if the expected value of the estimate 


equals the true value of the parameter: / 6 | 6| = 9. Otherwise, the 
estimate is said to be biased: 6 | 6| # 0. The bias b(6) is usually 
considered to be additive, so that b(0) = E 6 | 6| — 9. When we have a 


biased estimate, the bias usually depends on the number of observations L. 

An estimate is said to be asymptotically unbiased if the bias tends to zero 

for large L: limit b = 0. An estimate's variance equals the mean-squared 
LI-oo 


estimation error only if the estimate is unbiased. 


An unbiased estimate has a probability distribution where the mean equals 
the actual value of the parameter. Should the lack of bias be considered a 
desirable property? If many unbiased estimates are computed from 
statistically independent sets of observations having the same parameter 
value, the average of these estimates will be close to this value. This 
property does not mean that the estimate has less error than a biased one; 
there exist biased estimates whose mean-squared errors are smaller than 
unbiased ones. In such cases, the biased estimate is usually asymptotically 
unbiased. Lack of bias is good, but that is just one aspect of how we 
evaluate estimators. 


Consistency 


We term an estimate consistent if the mean-squared estimation error tends 
to zero as the number of observations becomes large: limit FE Ee =i0 
LI- oo 


Thus, a consistent estimate must be at least asymptotically unbiased. 
Unbiased estimates do exist whose errors never diminish as more data are 
collected: Their variances remain nonzero no matter how much data are 
available. Inconsistent estimates may provide reasonable estimates when 
the amount of data is limited, but have the counterintuitive property that the 
quality of the estimate does not improve as the number of observations 
increases. Although appropriate in the proper circumstances (smaller mean- 
squared error than a consistent estimate over a pertinent range of values of 
L, consistent estimates are usually favored in practice. 


Efficiency 


As estimators can be derived in a variety of ways, their error characteristics 
must always be analyzed and compared. In practice, many problems and the 
estimators derived for them are sufficiently complicated to render analytic 
studies of the errors difficult, if not impossible. Instead, numerical 
simulation and comparison with lower bounds on the estimation error are 
frequently used instead to assess the estimator performance. An efficient 
estimate has a mean-squared error that equals a particular lower bound: the 
Cramér-Rao bound. If an efficient estimate exists (the Cramér-Rao bound is 
the greatest lower bound), it is optimum in the mean-squared sense: No 
other estimate has a smaller mean-squared error (see Maximum Likelihood 
Estimators for details). 


For many problems no efficient estimate exists. In such cases, the Cramér- 
Rao bound remains a lower bound, but its value is smaller than that 
achievable by any estimator. How much smaller is usually not known. 
However, practitioners frequently use the Cramér-Rao bound in 
comparisons with numerical error calculations. Another issue is the choice 
of mean-squared error as the estimation criterion; it may not suffice to 
pointedly assess estimator performance in a particular problem. 
Nevertheless, every problem is usually subjected to a Cramér-Rao bound 
computation and the existence of an efficient estimate considered. 


Time-Delay Estimation 


An important signal parameter estimation problem is time-delay estimation. Here the unknown is the time origin of 
the signal: s(1,@) = s(l — 0). The duration of the signal (the domain over which the signal is defined) is assumed 
brief compared with the observation interval Z. Although in continuous time the signal delay is a continuous- 
valued variable, in discrete time it is not. Consequently, the maximum likelihood estimate cannot be found by 
differentiation, and we must determine the maximum likelihood estimate of signal delay by the most fundamental 
expression of the maximization procedure. Assuming Gaussian noise, the maximum likelihood estimate of delay is 
the solution of 


ming {6,(r — 5(6))"Kn* (r — 3(6))} 


The term s’ K,, ‘s is usually assumed not to vary with the presumed time origin of the signal because of the 
signal's short duration. If the noise is white, this term is constant except near the "edges" of the observation 
interval. If not white, the kernel of this quadratic form is equivalent to a whitening filter. As discussed later, this 
filter may be time varying. For noise spectra that are rational and have only poles, the whitening filter's unit- 
sample response varies only near the edges (see the example). Thus, near the edges, this quadratic form varies with 
presumed delay and the maximization is analytically difficult. Taking the "easy way out" by ignoring edge effects, 
the estimate is the solution of 


maxg {6, r’ K,,'8(6)} 
Thus, the delay estimate is the signal time origin that maximizes the matched filter's output. 


In addition to the complexity of finding the maximum likelihood estimate, the discrete-valued nature of the 
parameter also calls into question the use of the Cramér-Rao bound. One of the fundamental assumptions of the 
bound's derivation is the differentiability of the likelihood function with respect to the parameter. Mathematically, 
a sequence cannot be differentiated with respect to the integers. A sequence can be differentiated with respect to its 
argument if we consider the variable to be continuous valued. This approximation can be used only if the sampling 
interval, unity for the integers, is dense with respect to variations of the sequence. This condition means that the 
signal must be oversampled to apply the Cramér-Rao bound in a meaningful way. Under these conditions, the 
mean-squared estimation error for unbiased estimators can be no smaller than the Cramér-Rao bound, which is 
given by 
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which, in the white-noise case, becomes 
Equation: 
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Here, s/(-) denotes the "derivative" of the discrete-time signal. To justify using this Cramér-Rao bound, we must 
face the issue of whether an unbiased estimator for time delay exists. No general answer exists; each estimator, 
including the maximum likelihood one, must be examined individually. 


Example: 
Assume that the noise is white. Because of this assumption, we determine the time delay by maximizing the 
match-filtered observations. 


argmax oS r(l)s(l — 6) = Ova 
ll 


The number of terms in the sum equals the signal duration. [link] illustrates the match-filtered output in two 
separate situations; in one the signal has a relatively low-frequency spectrum as compared with the second. 
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Matchec-Filter Output 


The matched filter outputs are shown for two separate 
signal situations. In each case, the observation interval 
(100 samples), the signal's duration (50 samples) and 
energy (unity) are the same. The difference lies in the 
signal waveform; both are sinusoids with the first having 
a frequency of (277)0.04 and the second (27r)0.25. Each 
output is the signal's autocorrelation function. Few, broad 
peaks characterize the low-frequency example whereas 
many narrow peaks are found in the high frequency one. 


Because of the symmetry of the autocorrelation function, the estimate should be unbiased so long as the 
autocorrelation function is completely contained within the observation interval. Direct proof of this claim is left 
to the masochistic reader. For sinusoidal signals of energy and frequency wo, the Cramér-Rao bound is given by 


2 
E [e?] = ee z This bound on the error is accurate only if the measured maximum frequently occurs in the 
dominant peak of the signal's autocorrelation function. Otherwise, the maximum likelihood estimate "skips" a 

cycle and produces values concentrated near one of the smaller peaks. The interval between zero crossings of the 


dominant peak is sane the signal-to-noise ratio Sy must exceed 4 (about 0.5). Remember that this result 


implicitly assumed a low-frequency sinusoid. The second example demonstrates that cycle skipping occurs more 
frequently than this guideline suggests when a high-frequency sinusoid is used. 


The size of the errors encountered in the time-delay estimation problem can be more accurately assessed by a 
bounding technique tailored to the problem: the Ziv-Zakai bound (Wiess and Weinstein, Ziv and Zakai). The 
derivation of this bound relies on results from detection theory (Chazan, Zakai, and Ziv). [footnote] Consider the 
detection problem in which we must distinguish the signals s(J — 7) and s(1 — 7 + A) while observing them in 
the presence of white noise that is not necessarily Gaussian. Let hypothesis .@ represent the case in which the 
delay, denoted by our parameter symbol 9, is 7 and .@, the case in which 6 = 7 + A. The suboptimum test 
statistic consists of estimating the delay, then determining the closest a priori delay to the estimate. 
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By using this ad hoc hypothesis test as an essential part of the derivation, the bound can apply to many situations. 
Furthermore, by not restricting the type of parameter estimate, the bound applies to any estimator. The probability 
of error for the optimum hypothesis test (derived from the likelihood ratio) is denoted by P.(7, A). Assuming 
equally likely hypotheses, the probability of error resulting from the ad hoc test must be greater than that of the 
optimum. 


A A 
Plt, A) < 1/2Pr|e ard | Mo| + 1/2Pr|e a= | M| 
Here, € denotes the estimation error appropriate to the hypothesis. 


A 


_ 6—r if under.4% 

6—7r-—A if under.H7, 
The delay is assumed to range uniformly between 0 and LZ. Combining this restriction to the hypothesized delays 
yields bounds on both 7 and A: 0 < 7 < L — AandO < A < L. Simple manipulations show that the integral of 
this inequality with respect to 7 over the possible range of delays is given by [footnote] 


L-A L A 
/ Pi(r,A) dr <1/2 f Priel > > |e] az 
0 0 


Note that if we define ap (4) to be the right side of this equation so that 


~( A 1 f* A 
a(S)-7/ Pr|lel > 5] mo] ar 


P(-) is the complementary distribution function [footnote] of the magnitude of the average estimation error. 


Multiplying P( 4) by A and integrating, the result is 


The reason for these rather obscure manipulations is now revealed: Because P(-) is related to the probability 
distribution function of the absolute error, the right side of this equation is twice the mean-squared error £ [e?] ‘ 
The general Ziv-Zakai bound for the mean-squared estimation error of signal delay is thus expressed as 


1 L L-A 
Be]>> f 4 [ P.(r,A)drdA 
L 0 0 


In many cases, the optimum probability of error P.(7, A) does not depend on 7, the time origin of the 
observations. This lack of dependence is equivalent to ignoring edge effects and simplifies calculation of the 
bound. Thus, the Ziv-Zakai bound for time-delay estimation relates the mean-squared estimation error for delay to 
the probability of error incurred by the optimal detector that is deciding whether a nonzero delay is present or not. 
Equation: 


dA 


E(e”] > 4 f aw-ayp(ajaae = PL) [ (+ a") dP. 


2 3L/ dA 


To apply this bound to time-delay estimates (unbiased or not), the optimum probability of error for the type of 
noise and the relative delay between the two signals must be determined. Substituting this expression into either 
integral yields the Ziv-Zakai bound. 

This result is an example of detection and estimation theory complementing each other to advantage. 

Here again, the issue of the discrete nature of the delay becomes a consideration; this step in the derivation 
implicitly assumes that the delay is continuous valued. This approximation can be greeted more readily as it 
involves integration rather than differentiation (as in the Cramér-Rao bound). 

The complementary distribution function of a probability distribution function P(x) is defined to be 


P(z) = 1-— P(z), the probability that a random variable exceeds x. 


The general behavior of this bound at parameter extremes can be evaluated in some cases. Note that the Cramér- 
Rao bound in this problem approaches infinity as either the noise variance grows or the observation interval 
shrinks to 0 (either forces the signal-to-noise ratio to approach 0). This result is unrealistic as the actual delay is 
bounded, lying between 0 and L. In this very noisy situation, one should ignore the observations and "guess" any 
reasonable value for the delay; the estimation error is smaller. The probability of error approaches 1/2 in this 
situation no matter what the delay A may be. Considering the simplified form of the Ziv-Zakai bound, the integral 
in the second form is 0 in this extreme case. 
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The Ziv-Zakai bound is exactly the variance of a random variable uniformly distributed over [0, L — 1]. The Ziv- 
Zakai bound thus predicts the size of mean-squared errors more accurately than does the Cramér-Rao bound. 


Example: 
Let the noise be Gaussian of variance 0, and the signal have energy E. The probability of error resulting from 
the likelihood ratio test is given by 


E 
P(A) =Q / z (1 — (A)) 
20n 
The quantity p(A) is the normalized autocorrelation function of the signal evaluated at the delay A. 
AA) = Yi ss(l— 4) 


Evaluation of the Ziv-Zakai bound for a general signal is very difficult in this Gaussian noise case. Fortunately, 
the normalized autocorrelation function can be bounded by a relatively simple expression to yield a more 
manageable expression. The key quantity 1 — p(A) in the probability of error expression can be rewritten using 
Parseval's Theorem. 


L=p(A) = — i 2(|S(w)|)? x (1 — cos(wA)) dw 


Using the inequality 1 — cos(x) < x?, 1 — p(A) is bounded from above by min {4F, ae where ( is the root- 
mean-squared (RMS) signal bandwidth. 
Equation: 
»_ J7,* (|S) rdw 
J7,((S(w)|)° dw 


Because Q(-) is a decreasing function, we have P.(A) > Q(u min {A, A*}), where ys is a combination of all 


of the constants involved in the argument of Q(-): 4 = a This quantity varies with the product of the 
signal-to-noise ratio =, and the squared RMS bandwidth 6?. The parameter A” = 2 is known as the critical 


delay and is twice the reciprocal RMS bandwidth. We can use this lower bound for the probability of error in the 
Ziv-Zakai bound to produce a lower bound on the mean-squared estimation error. The integral in the first form of 
the bound yields the complicated, but computable result 


Ele*] > * alu min {1, A’\) Se oa Bs (u min joa |) a ( (1 at min {L? 


The quantity lie (-) is the probability distribution function of a x? random variable having three degrees of 
freedom.[footnote] Thus, the threshold effects in this expression for the mean-squared estimation error depend on 
the relation between the critical delay and the signal duration. In most cases, the minimum equals the critical 
delay A“, with the opposite choice possible for very low bandwidth signals. 

This distribution function has the "closed-form" expression P,2 = (1 —@Q (Vz) = af = e-*). 
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Ziv-Zakai Bound 


Mean-Squared Delay Estimation Error 


Signal-to-Noise Ratio (dB) 


The Ziv-Zakai bound and the Cramér- 
Rao bound for the estimation of the time 
delay of a signal observed in the 
presence of Gaussian noise is shown as 
a function of the signal-to-noise ratio. 
For this plot, L = 20 and 6 = (27)0.2. 
The Ziv-Zakai bound is much larger 
than the Cramér-Rao bound for signal- 
to-noise ratios less than 13 dB; the Ziv- 
Zakai bound can be as much as 30 times 
larger. 


The Ziv-Zakai bound and the Cramér-Rao bound for the time-delay estimation problem are shown in [link]. Note 
how the Ziv-Zakai bound matches the Cramér-Rao bound only for large signal-to-noise ratios, where they both 


— ae 


equal 1/ Vi = Ee For smaller values, the former bound is much larger and provides a better indication of the 


size of the estimation errors. These errors are because of the "cycle skipping" phenomenon described earlier. The 
Ziv-Zakai bound describes them well, whereas the Cramér-Rao bound ignores them. 


Introduction to Active Sonar 


Active sonar processing involves detection, classification and localization. 
We need to detect if an echo is present, decide if the echo comes from a 
target or clutter, and determine the position and velocity of the target that 
formed the echo. These are the fundamental operations of an active sonar 
system. 


During the target echo formation process, other signals are generated by the 
transmitted sound. Sound will scatter and reflect off of the ocean bottom 
and surface causing reverberation and direct blast arrivals. Ambient noise, 
generated by distant shipping and waves breaking on the ocean surface will 
also be present. 


The direct blast corresponds to the sound energy that travels from the 
source to the receiver via a direct path or with only specular reflections 
from the surface and bottom. Specular reflection is the mirror like reflection 
from a surface, and means that the sound ray's angle of incidence and angle 
of departure are equal. The direct blast energy arrives as a set of discrete 
arrivals spaced out in time, corresponding to the number of surface and 
bottom bounces (or reflections) that occurred for each arrival. Each surface 
and bottom interaction involves some loss, so that arrivals that incur more 
surface and bottom reflections arrive later in time and are lower in 
amplitude. 


Reverberation corresponds to the diffuse scattering from the surface and 
bottom. When the sound energy interacts with either boundary, it scatters in 
all directions, with most of the energy following a specular reflection path. 
The non-specular scattering continues to propagate to the receiver and 
generates a continuous, overlapping set of signal replicas termed 
reverberation. Reverberation has a smooth, continuous power versus time 
profile. Direct blast arrivals are discrete. 


Echoes, direct blast and reverberation will become louder as the transmitted 
signal gets louder. 


Sonar Receiver Model 


Active Sonar Receiver Model 


Consider the sonar receiver processing chain shown below: 


el 7M! 
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The sonar array input is heterodyned, low-pass filtered, sampled and scaled 
to generate a discrete time set of samples of the noise plus signal. 


The input noise n(t)is a real-valued, wide sense stationary random process 

with power spectral density P,,(f). Because n(t)is wide sense stationary, 

the power spectral density and the autocorrelation function are related by 
oo . 

Finlay) pf Paalpie oe dt 
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We assume that n(t)is zero mean. A complex carrier e/”"**is applied to the 
random process n(t)resulting in a complex random process z(t). The mean 
value of z(t)is given by: 


E{z(t)} = E{n(t)e??*"*} = E{n(t)}es?t = 0 
The covariance of z(t)is given by 


E{z(t)z(s)} = e7?"f-9) E{n(t)n(s)} = e7?*f-5) R,, (t — s), 


which shows that z(t)is wide sense stationary as well. 


z(t)is passed through a band-pass filter to produce x(t). The frequency 
response of the band-pass filter is assumed to be low-pass with a bandwidth 
of B/2Hertz. That is we will assume that the filter transfer function 


Hppr(f)is given by: 


1,| f |< B/2 
H(f) = 
0,| f |> B/2 
The resulting x(t)is a wide sense stationary random process with zero- 
mean and power spectral density: 


No/2,| f |< B/2 


hee mee 


Pas(f) = { 


Where we have assumed that the bandwidth of the receiver is small relative 
to the center frequency of the signal we are trying to detect, B/ fy<<1. The 
power spectral density of x(t)can then be approximated by the power 
spectral density of the noise near fo: 


Pils) Po Fo) = No/2, | f= fo |< B/2 


If x(t) has a power spectral density given by Eq-1, then the autocorrelation 
function of x(t) becomes: 


B/2 
R(T) == E{ax(t)x(t as T)} = f Ao eJén i af = Ao sina 
—B/2 


Note that R,,(0) = a8 


Now if we choose a sampling interval At = 1/B; then the samples at kAt 
have an autocorrelation given by 


E{x(kAt)« (lAt)} = oe = BN 6, 


Hence x(kAt),k = 0,1,... is a discrete time, wide sense stationary, white 


For matched filtering applications, we scale the output of the Analog to 
Digital conversion process by At = 1/Bto conserve the signal energy over 


a time interval T. This creates the discrete time process y;, = 2(kAt)v At. 


To see this, consider that 


By Fie) ‘ah = fata dt = J Res(0) 0)dt = BEN 
And 


hoe t 2 k= ; 
Bf Si | x(k) P} = CE wf] (A) P} = CE BN At = BIN 


Active Sonar Detection in Ambient Noise 


Introduction 


In a sonar system, one is often searching for targets, e.g. submarines, mines, or fish. 
The sonar system gathers sounds from its acoustic sensors searching for either echoes 
or sounds emitted by the target. These received sounds are the observations, and for 
active sonar, are collected into sets of observations related to the time of transmission 
of the sonar waveform. The sounds related to the broadcast of a single transmission is 
called a ping history. These pings occur sequentially in time, so one naturally has a 
sequence of observations (sound recordings) indexed by the time that a ping was 
transmitted. 


The sonar system decision space includes hypotheses about the target’s presence, 
location, velocity and classification. The observations Y;, at ping k contain information 
about the sonar decision space, but are also influenced, and often dominated by other 
sounds, such as noise and reverberation. Echoes, noise and reverberation are 
significantly influenced by the propagation properties of the ocean. These 
environmental effects are important when making useful inferences about the target 
echoes that may or may not be present in the sonar ping history. 


In most sonar systems today, environmental interference effects, such as noise and 
reverberation, are treated as random variables. The sonar processing designer develops 
algorithms that make detections and estimates target states by assuming a Statistical 
model of the echo and interference, choosing environmental interference model 
parameters (amplitude, covariance, autocorrelation, etc.) and then computing a 
detection decision or state estimate. 


The environmental effects are usually estimated as part of the target state decision 
process, or the processing algorithm is constructed to be invariant to the environmental 
effects. 


The primary detection processing method for current active sonars is to process the 
ping history with a bank of matched filters. The filters are constructed so that each 
filter is constructing the cross-correlation between the transmitted waveform and the 
pre-whitened ping history. 


A monostatic sonar has the source and receiver in the same location, and hence the 
receiver cannot realistically capture the ping history until the waveform transmission is 
complete. To simplify the problem, we will look for targets that are far away from the 
sonar, so that the echo reception occurs when the reverberation level has fallen below 
the background noise level. In this way, we are dealing with target echoes that are 
essentially embedded in background noise only. 


Decision Space 


We use the symbol ¢ to designate the target absent hypothesis. The other hypotheses 
concern the location of a single target. We then use as a decision space the composite 
space D = y U H' where His the space constructed from all target present 
hypotheses. The target hypothesis space consists of those locations around the sonar 
that generate echoes embedded in noise only. These location hypothesis h are part of 
the decision space h = (x,y) € H’specified by the set of ordered pairs h = (x,y) 
such that 


h= (z,y) © H7if Rnin < JV (a tr)? + (y— Yr)? < Rmax 


Where we are assuming that the depth of the target is small when compared to its (x,y) 
coordinates, the receiver is located at (x,,y,). Rminis the range at which the echo is 
noise, not reverberation limited, and Ryaxis the farthest range of interest. For this 
problem, / is an index into the target range from the sonar. 


Sonar Receiver Model 


The sonar transmits the waveform m(t)for each ping. In most sonar transmitters, the 
transmitted waveform is narrow-band, that is, the waveform bandwidth is much smaller 
than its center frequency, f. This is true because efficient sonar transmitters use 
resonant mechanical and electrical components to provide maximum electrical to sound 
power transfer. An approximation therefore is to model the transmitted waveform as an 
amplitude modulated carrier: 


m(t) = sin(2nft)w(t), t = (0,7) 


We will assume that the target is motionless, so that Doppler effects can be ignored. We 
will assume that the sonar receiver is a single sensor, with no directionality 
characteristics. For each target location hypothesis h = (x,y) we know approximately 
the received echo time series: 


g(t | h) = Bm(t — 2R/c) 


The amplitude B is related to the propagation loss out to the target hypothesis location, 
and the reflection characteristics of the target. The time delay 2R/c corresponds to the 
time it takes for the transmission waveform to reach the target and return to the sonar. 
R is the range to the target and c is the effective speed of sound, when including 
refraction and boundary reflections. 


The received echo is band-limited to approximately the same frequency band as the 
transmission. The receiver bandwidth may be greater than the transmitted bandwidth 
due to Doppler frequency shifts, but for the present, we are assuming that the target is 
not moving. Sonar receivers use heterodyne techniques to reduce the data storage of the 
ping history. The sonar receiver multiplies the ping history by a carrier signal e~/?"!*to 
shift the positive frequency part of the received echo closer to DC. The resulting signal 
is then low pass filtered to eliminate the shifted negative frequency part of the ping 
history. Since the original ping history was real, the negative frequency part of the 
signal spectra carries no additional information. The result is a complex signal with a 
lower bandwidth, but retains all of the echo related information of the original ping 
history. This heterodyne process can be done in the analog or digital domain. 


A target echo passing through the heterodyne part of the sonar receiver becomes: 
r(t | h) = Ae” w(t — 2R/c) 


The phase shift @ corresponds to the phase shift due to heterodyne operation; the 
uncertainty in propagation conditions; and the summation of multi-path arrivals with 
almost the same time delay, etc. 


We will assume that the target echo amplitude, Ae” jis a complex Gaussian random 
variable with zero mean and with standard deviation a”(h).We are modeling the echo 
as having the same waveform as the transmission, but with an uncertain phase and 
amplitude. This is assuming that the target echo amplitude obeys Swerling target type I 
statistics with unknown phase. 


While receiving an echo, we will also receive ambient noise, g(t), which we will 
assume to be complex Gaussian noise, with constant power spectral density over the 
receiver’s bandwidth. The noise power spectral density over the receiver’s bandwidth 
BW is assumed to be NoPascals\2/Hertz. 


We receive L complex valued samples, y as the ping history after heterodyning. For 
hypothesis h, the observation in discrete time is: 


y(kAt) = q(kAt) + r(kAt — 2R/c|h)for,k=1,...0 


where Att is the digital sample rate after heterodyning, and q is a sample of the noise 
and reverberation. Note that r(kAt — 2R(h)/c | h) = 0 when 2R(h)/c > kAt, 
because the echo is delayed. The delay for hypothesis h in samples, is given by 


D(h) = [24 


| 


where [x] is the nearest integer to x. We choose the sample rate At to be small enough 
to satisfy the Nyquist sampling criteria for the received echo. We will assume that the 
non-zero part of the echo is N samples long. 


Statistical Model of the Ping History 


We will represent the sampled echo response as a partitioned vector: 


On(h) wt 
00) ee Aw ; 
Or_p(n)-n 
Where 
w(At) 
w= : 
w(NAt) 


and the sampled noise and interference as a vector 


q1 


qL 


so that the sampled ping history becomes 

y= q+tr(h) 

The echo is modeled as a known signal w, with Gaussian random complex amplitude 
A, with zero mean and variance on h)* We will assume that ww = 1, and that 

| A(h) |? is the energy of the echo, with units Pascals\2-seconds. Since o74(p)i8 


E| A(h) |’, ithas units of Pascals\2-seconds as well. The amplitude of the echo is a 
function of the target location hypothesis h. The location of win r(h) depends on the 
location of the target through the time delay D(h). 


Since each element of the random vector Aw is complex Gaussian, the random vector 
Awhas a complex Gaussian distribution. The probability density of Aw is Gaussian 
zero mean with covariance matrix on wwe . To see this, consider that 


E(Aw) = E(A)w = 0n 
The covariance of Aw is given by: 
E(Aw)(Aw)# = E(AA")ww® = o4(nyWW 


hence r(h)is zero mean complex Gaussian with covariance matrix on ptt 


For the clutter only hypothesis y,y = q. 


We have sampled, heterodyned and possibly re-sampled the noise process g(t)to form q 


During the period where r is non-zero, gis a sampled version of the ambient noise, 
represented as a N by 1 complex Gaussian noise random vector with zero mean and 
covariance matrix (No)Iy. This is true because BW At ~ 1 for complex Nyquist 
sampling of a band-limited signal. 


Overall, the noise and reverberation g is assumed to be complex Gaussian with zero 
mean and L by L covariance matrix C. 


Because we are assuming that the reverberation dies away before the echoes from the 
target search arrive, C’ has the following partition: 


R 0 
c=, al 


Matrix R has dimensions of DinxD min, the minimum delay where the echo 
interference is dominated by Ambient noise. 


Under target hypothesis h , y is Gaussian with has zero mean and covariance matrix 
C+ogerr®. 


The probability density of yunder h becomes: 


p(y | h) = sraeyaaey exP(—y7(C, + C)y), 


where C;, = o gorr®. 


Under the clutter hypothesis, y, y has zero mean and covariance matrix C’.The 
probability density of yunder y becomes: 


P(y | 9) = sraerey exP(-¥"(C)*y) 


Active Sonar Likelihood Function 


We will find that the natural logarithm of the measurement likelihood ratio simplifies 
the detection expression considerably: 


log ae = log(det(C)) — log(det(C + C,)) + y# (C1 — (C, + C)*)y 


The determinant of Cis det(C) = det(NoIz_p,,,, )det(R) = (No)*-?™det(R) 


It is convenient to partition C + C;,into sub-matrices compatible with r. 


R 0 0 0 

0 Nolpao-tin 0 0 
ee i 0 Noly + oq) ww" 0 

0 0 0 Nolt—p(h)—n 


The determinant of C' + C,is 

det(C + C,) = det(R)det(Nolp(n)—pyi,)det(NoLn + o%4 tt det (Nolz—pin)-w) 
Which becomes 

det(C + C,) = (No)2~?P=Ndet(Noly + ont )det(R) 

Using Sylvester’s determinant theorem: 

det(J + AB) = det(J + BA) 


We obtain: 


oan) eT 
No 


det(C + C,) = (No)*~? == (1 + )det(R) 
Which equals 
det(C + C,) = (No) P= (1 + 4" )det(R) 


Now we partition the observed ping history y into 


YN1 
Y= 
V’h 
YN2 
so that: 
0 0 0 0 
7 é oy 0 0 0 0 
SY 65 win — (Noln + 0%, ww") 
0 0 0 0 


The Woodbury matrix identity states: 
(A+UCV)1 = At ATU(C + VATU) VA™ 
So that 


oo 


(Nolw + oan) WW 


And hence 


= 7 o)/No 2 
YE (C8 = (C+) y = fy | att 


So that the log likelihood function becomes 
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Putting the noise variance inside the observation vector yields for the log likelihood 
function 


N 
Aqy/No | ay Hye is 


L(y|h) 
logsaiay = log(1 + 24% Wy 4 = a4 ht war 
No 


The log- Reanooe ratio shows that the assumed echo shape, w# , the assumed energy 
of the signal a, ACh)’ the ambient noise density No, and the expected location in time of 


the echo needed to select y;,are the parameters required to evaluate the log-likelihood 


function. The assumed echo shape and energy can vary by hypothesis h, but the noise 
properties Nohave been assumed to be the same for each decision hypothesis h. 


Signal to Noise Ratio of the Detector 


Ay os ; : 
The magnitude squared term, | wt vin |, is a matched filter, where the observations 
0 


are cross-correlated with the signal template. The observations are normalized by the 
noise variance before cross correlation, which is a form of pre-whitening. The term 


2 
oO 
Se is the Energy to Noise Density Ratio (ENR) of the detection problem. Note that 
the signal to noise ratio of the matched filter output can be written as: 
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This result is a general result for matched filters. A matched filter’s SNR is the Energy 
to Noise Density Ratio of the problem. The energy of a signal being detected is related 
to the average amplitude and the duration of the signal. The SNR output of the matched 
filter is independent of the details of the waveform being detected, only the signal 
energy and the noise spectral density determine the matched filter response. 


The likelihood function is dependent only the ENR as well. 


A-priori Assumptions 


Before the ping history is received, we assess the probability of each hypothesis, p(h) 
and p(y). The a-priori information may have come from previous pings, or are 
probabilities assigned by the sonar system to begin a target search. 


Using the logarithm of probability density function ratios simplifies the expressions: 


PRY) Jog Eh) 


p (h) 
AGI) Laas op 


p(y) 


log 
Using the likelihood ratio notation, 


h i 
logA(h | y) = log 74. + log A” (h) 


Once we compute logA(h | y), we can declare a target is present, with confidence pr 
by computing: 


Target Present if: { A(h)dh > —-- 
heHt 2 


Because the target hypothesis space contains many hypotheses, this detection problem 
is can be considered a composite hypothesis test. 


An alternative approach to detection of a target with an unknown range is solved by 
finding the target range hypothesis with the greatest measurement likelihood as the 
detection statistic [Kay]. This is referred to as Generalized Maximum Likelihood Ratio 
Testing (GLRT). 


L(y|h) 
L(y\y) 


Target Present if: maxlog > ry 


The GLRT approach is often easier to implement than the Bayes detection approach, 
because one avoids the integration/summation over a-priori probabilities. 


Properties of Active Sonar Matched Filtering 


Properties of Active Sonar Matched Filtering 


Introduction 


Matched filters are used extensively in coherent active sonar. The output of a matched filter is used for 
detection, classification and localization. This document develops some properties of matched filters, 
including the SNR response in ambient noise and the response to reverberation. 


In a matched filter for active sonar, we are integrating the echo plus interference times the echo’s replica. 
When an echo passes through the matched filter, we are cross-correlating the echo with a scaled version 
of the echo, so that the output is a scaled version of the auto-correlation of the echo corrupted by noise. 
The autocorrelation of the echo has a peak in time whose duration is approximately the inverse of the 
echo’s bandwidth. 


For some waveforms (such as the Sinusoidal Frequency Modulation pulse) the autocorrelation function 
will have multiple peaks, termed ‘fingers’, due to the periodic structure of the pulse. Each 
autocorrelation finger has a time width approximately equal to the signal’s bandwidth. 


Continuous Time Matched Filter 


The echo is written as 


e(t) = VEpr(t), where J r2(qae = 


UM 
This implies that the echo energy f e*(t)dtis Er, measured in Pascal/2-seconds. 
0 


We can write the matched filter operation in continuous time as 
t+T 


m(t)= f y(o)r(o— t)do 


t 


y(o)is the receiver time series. In response to a target echo that arrives at Tpseconds and without noise 
or reverberation, the receiver output is y(t) = e(t — Tp). The output of the matched filter becomes: 


iT t+T 
m(t)= f e(o—Tp)r(o—t)do = VER f[ r(o—Tp)r(o — t)do 
t t 
Hence m(Tp) = V Er. The peak power output of the matched filter, m(t), in response to a echo is 
Ep. 


We determine the matched filter response to noise next. Assume the input noise is white with variance 
ANo: 


E{n(t)n(s)} = ANod(t — s) 


Note that the delta function has units of inverse seconds, and therefore ANohas units of Pascals\2/Hz, 
equivalent to a spectral density. From the definition of stationary random process autocorrelations and 
power spectral density, we know that the Fourier transform of the autocorrelation is the spectral density 
function for the random process. The Fourier transform of covariance becomes 

fe?" ANgd(r)dr = ANo, which is the spectral density of the noise. 


t+T t+T 
E{m(t)m(s)} = Ey f n(o)r(o—t)do f n(B)r(B —t)dB > = 
t t 
Ar er t+T 
ANo f f 6(o- B)r(o — t)r(B— t)dodB = ANo f[ r?(o —t)do = ANo 
& “a t 
Thus, the noise power response of a matched filter is the input spectral density, ANo. 


We conclude that the signal to noise ratio (SNR) at the output of a matched filter is the ratio of the echo 
energy to the noise spectral density, Ez /ANo. This assumes that the noise is white, e.g. a flat spectral 
density at the input to the matched filter. This is a general result, independent of the signal waveform 
details, except for its energy Ez. 


Discrete Time Matched Filters 


Discrete time filters have nearly the same properties as continuous time filters. In discrete time, we 
assume an echo of e(k) = /Ear(k), with )>)_, r2(k) = 1. The discrete matched filter output to the 
input y(k) is given by: 


m(k) = ize y()r(l— k +1) 
In response to the echo, y(k) = e(k — Tp)the output of the discrete time matched filter is 
m(k) = ort, e(l—Tp)r(l— k +1) = VER r(l— Tp)r(l— +1) 


Hence m(T’p — 1) = Ep. The peak power output of the matched filter, m?(t), in response to a echo 
is Fp. 


We determine the discrete matched filter response to noise next. Assume the input noise is sampled 
white with variance ANo: 


E{n(k)n(l)} = ANodia 


E{m(k)m(p)} = BY ee (rl k+ 1) OE nla —p +) } = 
ANo Die ot bur(lL—k + U)r(i—p +1) = ANo Dey 2(1— k +1) = ANo 


Thus, the signal to noise ratio at the output of a discrete time matched filter is Ep /ANo. 


The matched filter compresses the echo signal to a pulse (or a series of pulses for waveforms such as 
SFM) with time width equal approximately to its inverse bandwidth, 1/BW. 


Matched Filter Response to Reverberation 


One model for reverberation assumes that the reverberation comes from distributed discrete scatterers, 
with density A(w). 


ult) = V Br f Au)P(u)r(t — 7(u)) au, 


A(u) is considered a random, spatial process that models the amplitude of the scattering that occurs at 
range u back to the receiver. We are assuming that the receiver has significant aperture, and that y(t) is 
the receiver response at the output of a beamformer. In this case, scattering is occurring from the patch 
of the ocean bottom or surface that lies at range u and within the receiver beamwidth in azimuth and 
elevation. Each patch of the bottom or surface will arrive at the receiver at a different time. I’(w)is the 
transmission loss from the source to the scattering range (u) and back to the receiver. 7(w)is the total 
travel time from source to scatterer to receiver. As one can see, the reverberation is made up of many 
time delayed and amplitude scaled replicas of the transmitted waveform. 


The matched filter response to the reverberation is 
ma(t) = f?*? o(¢ — 8) VEp i AWG =Fa)ads = 
JEx f A(u)P(u) f2*? r(o — r(u))r(o — ddodu 
0 
We define the transmitted waveform autocorrelation function as 


x(r) = fo r(t —7)r(t)dt 


Recall, that by definition, y(0) = f, - r*(t)dt = 1. In more general terms, we define the transmitted 
wideband signal ambiguity function as 


xwe(t.7) = V7 fo. 7 (n(t — 7))r(t)at 


Note: Some authors define the ambiguity function as the magnitude squared value of this definition. 
Other authors choose different normalizations or the sign (-/+) on the delay term T. 


In the wideband signal ambiguity function, the Doppler effect is represented by the scaling factor 7. In 
narrowband cases, the Doppler effect is represented by a frequency shift, y. For a monostatic sonar the 
frequency shift is given by yp = 2v/c, where vis the radial velocity between the scattering object and the 
sonar system. 


xnp(T,9) = f°. r(t —T)r(t)e 7?" dt 


One can show (Weiss) that the narrowband approximation to Doppler is valid if 2v/e<< TT where Bis 
the waveform bandwidth and Tis the duration. For one hundred (100) Hertz bandwidth waveforms that 


last for one (1) second, the speed of the target must be much less than 7.5 m/sec, or approximately 15 
knots. 


An important invariance property of the narrowband ambiguity function is that 


S| xa(7,9) Pat =1 


Using either definition of the signal ambiguity function we have 


fr" ro = r(u))r(o = t)do = fy r(o’ = (r(u) — t))r(o’)do’ = x(r(w) - 4,0) 


Therefore the matched filter response is 
mr(t) = VEr f A(u)D'(u)x(r(u) — t,0)du 
0 


This expression for the matched filter response shows that the amount of reverberation at time t is 
directly related to the transmitted signal’s autocorrelation and energy source level (ESL). Wide band 
signals will have narrower autocorrelation peaks, and thus less reverberation amplitude. 


Using the narrowband approximation for Doppler shifts allows efficient implementation of matched 
filter banks as generalized spectrogram analysis. One treats the replica as a “window function” in place 
of the more traditional Hanning or Hamming windows. The matched filter for narrowband Doppler 
shifts is given by: 


+7 
m(te) = f ylo)r(o— t)e-22"*do 
t 


Which can be rewritten as: 
a Co 2 Pa r , a 
m(t,p) = e?™ f y(t+oa)r(o je?" {7 do 
—oo 

The spectrogram with window function w(c) is given by: 

ee a , 4 i a 

f u(t + o°w(o'Je #2" da 
—oo 


(t,p) = ||" 


Hence, the squared envelope of the narrowband Doppler matched filter is a spectrogram with the 
window function being the conjugate of the transmitted waveform replica. This interpretation of 
narrowband matched filtering lends insight into the use of different window functions on transmitted 
waveforms. Often one adds a Hanning or Tukey window to the transmitted waveform. This windowing 
is necessary in some cases because the sonar transmitter cannot turn on and off instantly. 


In spectral analysis, windows are used to control ‘spectral leakage’, which occurs because of the finite 
time window used for frequency analysis. Spectral leakage generates sidelobes from strong tones that 
mask low amplitude tones at different frequencies. In a matched filter using Doppler resolving 
waveforms, reverberation will be much stronger near zero Doppler than at other Doppler frequencies. 
The waveform windows help keep Doppler sidelobes of reverberation from masking the lower amplitude 
target echoes that may occur at high Doppler. 


Using matched filters based on the narrowband approximation to process echoes with Doppler beyond 
the narrowband approximation limits will result in correlation loss. The correlation loss will result in a 
loss of Signal to Noise ratio for these echoes. 


This presents a fundamental design decision for a sonar system that needs to process echoes with 
Doppler on the order of 15 knots or greater. If one uses “narrowband” processing for efficiency, then one 
has to limit the waveforms to those that satisfy the narrowband approximation. However, as shown in the 
earlier sections, having a larger bandwidth will reduce the autocorrelation time of the waveform and thus 
reduce the response to reverberation. This reverberation versus bandwidth property advocates the use of 
wideband waveforms and hence, broadband matched filtering. There are, however, waveforms that have 
low correlation loss across all Doppler shifts. These are known as hyperbolic frequency modulation 
(HFM) waveforms. 


So, one can use narrowband processing, and restrict the waveforms to low bandwidth (1 Hertz say) 
waveforms such as a pulsed sine wave, and wideband waveforms with high Doppler tolerance, such as 
HFM. If one wants to use waveforms that have Doppler resolving power and high bandwidth, one needs 
to use broadband matched filtering. 


Doppler Sensitive Waveform Matched Filtering 


In some cases, one uses a signal that is Doppler sensitive, e.g. the signal ambiguity function .(7,7)is a 
strong function of the Doppler variable. Examples of these waveforms are CW, SFM and comb 
waveforms. Other waveforms are less sensitive to Doppler effects, beyond a time delay/Doppler 
coupling effect. 


In the cases where one is using Doppler sensitive waveforms, the matched filter is generalized to a 
matched filter bank, indexed by both time (range) and Doppler (7) 


t+T/n 
m(t,n) = /n f y(a)r(n(o — t))do 


This allows one to search for targets that have relative motion to the source and receivers of the active 
sonar. When the received signal is a Doppler scaled echo, then the filter that matches the echo Doppler 
will be a matched filter for that echo, and obey the same SNR properties for echoes embedded in noise 
as the earlier discussion for stationary targets echoes. The replica is matched to the echo compression 
and time delay. To ensure energy consistency, we scale the zero Doppler replica r(¢) by ,/7 when using 
it for other Doppler hypotheses. This comes from the fact that 

T/n 
fr? (nt)dt = 1/n 


0 


Now, using the reverberation model as before, we have the following expression for the matched filter 
bank response to reverberation: 


—_— °° t+T'/ 
ma(tsn) = VnEr f A(u)P(u) fp" ro — t(u))r(nlo — t))dodu 
0 
The integral including the replicas, using the change of variables 7 = o — t, can be written as 


Tj = i r(o + 6)r(no )do’, where 5 = t — r(u) 


We can extend the limits of the integral, because r(no )is zero outside the integration limits. Extending 
limits and changing variables too’ =o + dyields 


I5n = Jo. 7(o r(n(o" = 8))do” = x(5,n)/n 


Hence the range Doppler matched filter bank response to reverberation becomes 
ma(tyn) = VEr f A(u)D(u)x(t — 7(u),n)du 
0 


This expression shows that if the target has Doppler, and one uses a replica matched to the target 
Doppler, further suppression of reverberation is possible. This suppression of reverberation is without 
loss to matching the target echo or loss in noise limited performance. 


Choosing waveforms that are broadband, and with roll-off with respect to Doppler in its signal 
ambiguity function, will optimize the active sonar’s processing in reverberation. For some waveforms 
(such as the Sinusoidal Frequency Modulation pulse) the signal ambiguity function will have multiple 
time delay peaks, termed ‘fingers’, due to the periodic structure of the pulse. Each autocorrelation finger 
has a time width approximately equal to the signal’s bandwidth. This will increase the reverberation 
response, relative to a waveform with a single autocorrelation response. However, the SFM waveform 
has a roll-off in Doppler as well. So for targets that have Doppler, there can be a Doppler shift where the 
roll-off in Doppler more than compensates for the additional autocorrelation peaks. 


We will make the assumption that A(u) is wide sense stationary, that is its statistics are invariant over the 
range of u: 


E{A(u)A(v)} = Ra(u — v) 


Furthermore, we will assume that A(u) is spatially white, e.g. the scattering elements are uncorrelated 
with each other: 


E{A(u) A(v)} = Ra(u—v) = Rad(u — v) 


Now, these two assumptions, that the reflection coefficient statistics are independent of range and each 
differential patch is statistically independent of each other is only an approximation to the real situation. 
However, these approximations allow one to see the interaction of reverberation and waveform selection. 


E{m3(t,n)} = BA Br i A(u)I'(u)x(t — r(u)sn)du f A(y)I'(y)x(t — r(p)map} 
Rearranging, 

B{mi(tn)} = Er f f ELA) WE e)xX(t— rw) x(t — 1(9)a\dudy 
Using the covariance of the scattering elements we get, 

E{m2(t,n)} = Er i i Ryd(u— g)F(u)I'(y)x(t — r(u).n)x(t — 7(y),n)dudy 

Or, 


Cc 


E{m>(t,n)} = ErRa f | D(u) |?2(t — 7(u),n)du 


To see this more clearly, assume that the transmission loss term I"(u)is approximately constant over the 
transmitted signal’s correlation time and receiver’s beam pattern. Then we obtain 


B{mi(ta)} = BrRa| Pwo) Pf x2(t— r(u) naw 


Where uois defined by T(uo) = 

If we assume that the time delay varies smoothly with respect to range, we can replace the integration 
over u with an integration over time delay 7, where we assume that the chance of variable from u to Tis 
approximately given by 7 = 2u/c, where c is the speed of sound. This is assuming an approximate 


monostatic geometry, or that the patch of reverberation is far away relative to the source receiver 
separation. 


We then get 
E{m}(t,n)} = ErRa| (uo) |'¢/2 f x?(t — rn)dr 
0 


If we assume that the matched filter time t is greater than the signal duration T, then letting 7 = t — 7, 
we obtain 


E{m},(t.n)} = ErRa| (uo) 'e/2 f x2(¢—7n)dr 
We define the Q-function of the waveform as 


Q(n) = Et | x(7 ym) [ar 


Note that Q(7)has units of seconds2. We call a waveform with a sharp peak in Q(7)as a Doppler 
Sensitive Waveform (DSW). A sine wave pulse will have a sharp peak in Q(7) for instance. 


When the narrowband ambiguity function is used the Q function is normalized: 
i Qna(y)dy =f | xnp(r’,p) |'dr’dp =1 


The wideband waveform Q function is approximately normalized to unity. 


The reverberation response can be written as 


E{m2(t,n)} = ErRa| I'(uo) ?Q(n)e/2 


Clearly, the best waveform to use for detection depends on the assumed target velocity. Waveforms such 
as HFM and LFM have low Q-functions that are relatively constant across Doppler. Doppler sensitive 
waveforms often have lower Q-functions at higher Doppler shifts than LFM and HFM, much higher Q 
functions near zero Doppler. To best search for targets, one needs waveforms optimized for both low and 
high Doppler targets. 


So far, this has been a deterministic description of the matched filter response to reverberation. 


Channel Doppler Effects on Reverberation 


In reality, the reflection coefficient or the transmission loss term will be time varying (as well as spatially 
varying) because of the surface of the ocean having waves, and the internal thermal structure of the 
ocean channel will be time varying. 


For bottom reverberation, we will assume that the reflection coefficient is time invariant. In shallow 
water at low frequencies (< 2000 Hz, say) the bottom reverberation dominates over surface 

reverberation. However, the acoustic propagation through the sound channel and specular reflection 
from the ocean surface introduces a time varying component to the reverberation formation process. 


To derive the results needed for channel Doppler effects, we will restrict ourselves to the narrowband 
model. 


The matched filter is given by: 
t+T 
m(te) = f y(o)r(o— te#2"*%da 
t 
Since r(t) = Ofort < Oand t > T, we extend the limits of integration for the matched filter response to: 


m(tye) = J ulo)r(o—t)e- "de 


We define the effects of reverberation, targets, clutter and the acoustic channel, via a spreading function 
S(7,y) acting on the transmitted waveform: 


uit) =VEx ff S(rwp)el*™r(t— randy 


This expression does not include contributions of ambient noise, only scattering phenomena. The 
spreading function S(7,y)defines the acoustic scattering, as a function of delay rand Doppler shift y for 
the sonar reception. The spreading function is a random variable, changing due to surface waves and 
time varying refraction effects (internal waves) in the sound channel. 


Target echoes will have a small vregion of non-zero spreading function, STarget(7T,~). Reverberation 
will have an extended 7 region with significant SReverb (7,9). The Doppler shift for both reverberation 
and targets will be related to receiver and source motion, as well as Doppler spreading due to surface and 
internal waves. The target will have additional Doppler contributions from its own motion. 


Substituting the spreading function description to the sonar response into the matched filter we obtain 


S(7,5)e3?"°r(ao — r)r(o — the 7?" 'dadrdé 


mie) =VEr ff 


B38 


Which equals 


m(t,p)=VEp f ff S(rd)eH"e-Dep(g — r)r(o — t)dodrdé 


—co —cCO — 


Letting 0’ = o — t, we obtain 


m(t,p)=VEp f ff S(r,d)e~#(e-I0'+)r(o')r(o’ — (t — 7) )dodrdé 


Using the definition of the narrowband ambiguity function, the matched filter response becomes 
m(t,yp)=VEr f f S(7,d)e-?™"-9) y(t — 7,9 — 5)drdd 


The response of the matched filter is a “twisted convolution” of the spreading function and the 
waveform ambiguity function. The exponential e~?"°—®) performs the twisting. Note that if the 
waveform ambiguity function was “perfect”, that is a single peak, 


x(7,") = 5(7)d(~) 


Then the matched filter response would become: 


m(t,y) a VETS(t,¢) ae n(t,p) 


Where n(t,~) is the response of the matched filter to ambient noise. In this sense, the matched filter is 
estimating the spreading function of the channel, with targets, clutter and reverberation all part of the 
spreading function. Note, however that x(0,0) = 1, so the ambiguity function cannot become a delta 
function. 


Now, the power output of the matched filter is desired, so that Signal to Interference Ratios and similar 
quantities can be predicted. We will make statistical assumptions about the spreading function. The 
assumptions are that the spreading function is wide sense stationary and uncorrelated. This implies that 
the signals being processed are statistically stationary and that the scatterers are uncorrelated; so that 
(Van Trees, II, Ch 13): 


E{S(r,~)S(r’,p')} = Rsg(r,)4(7 — r')8(p — v’) 


Rgg(7,y) is known as the scattering function of the active sonar scenario. The description of the target, 
reverberation and clutter statistics are captured in this expression. 


Using this definition, we obtain for the power of the matched filter: 


E{m(t,p)m(t,p)} = 
Er f f S(r,d)e""-9 y(t — r,p — d)drdd ff S(r’,6 )et?*™e-8)" y(t — 7,9 — 8 )dr'dd’ 


Which becomes 


B{| m(ty) ?}=Br ff Rss(r.)| xt 7.9 —8) Parad 


—co —ocoO 


Using two dimensional convolution notation **, this expression for the matched filter power P(t,y) 
becomes 


P(t,p) = B{| m(t,y) |} = Bp Rss(t.y)™*| x(t) [ 


Now, let us model the acoustic sonar problem as target, clutter/reverberation and noise. The matched 
filter power response to ambient noise was shown earlier to be ANo. The overall signal to interference 
ratio, for a target at range tand Doppler y is 


Prarget (tp) 
I = arget \l, 
S R PReverb/clutter (t,~) +ANo 


Which becomes, 


SIR(t,~) = 


2, AN 
x(t9)|"+ =r 


DV 
* 
—* 


This expression can be simplified for different target and scattering conditions. For a point target at 
range tgand Doppler Yo, the target scattering function becomes 


Ra (t,) = Sd(t — to )d(y az £0). 


Often, we can assume that the reverberation scattering function is constant in the vicinity of the target, so 
that 


Vv Cc 
Re erb/ Intter (4 ip) = Ri, Qrevro(¥) 


Qrevrb(y) describes the Doppler roll-off of the reverberation. It will be affected by the source, receiver 
and ocean motion. In this characterization, we are ignoring “discrete clutter”, e.g. target like responses 
from bottom features. 


We will assume that QRevrb(y)is normalized, so that: 
f Qrevin(y)dy =1 
With these assumptions we obtain 


STB (ty, 90) |x(t-to,e—Yo)” 
SIR(t = 
( #) R,, Sf Qnevern(¥-d)|x(7,9)|"drdb+ 2 


Note that when the matched filter is matched in time and Doppler, then t = t)and y = Yo, and the 
numerator is maximized: 


ST (too) 


SIR(t Faia a (archi eAraeas aD 
(to,0) R,, IS Qneveri (P0—8)|x(7 90) ?drdd-+ 52 


The denominator can be simplified further by using the definition of the waveform Q function: 


Q(¥) = Sl x(z.) [Par 


Using this definition, we obtain: 


STareet (ty (po) |x (t—to,e—Yo) |” 


AN 
R,, QReverb(y) *Q(y~)+ zr 


SIR(t,p) = 


When using a waveform with a Q function much wider than the reverberation Q function, (such as an 
HFM), the waveform Q function can be replaced by a constant, such as Qyrm. The convolution with the 


reverberation Q function becomes the constant Qyrm (because of the normalization of QReverb) : 


Qrevers(Y) * Q(~) = Quru f Qrevers(y — ¢ dp = Quem 


With this approximation the signal to interference ratio becomes: 


STB (ty Wo) |x(t—to,~—¥o) |” 
SIRupa(t,p) = “Geto ye voll AN, a 
R,,Qurmt+ Fz 


When the waveform is Doppler sensitive, and it’s Q function is narrower than the Reverberation Q 
function, we can approximate the waveform Q function by a rectangle of height T and width 1/T 
centered at zero Doppler. Than the convolution becomes: 


1/2T 


QReverb(Y) * Q(¢) = f QrReverb(Y 3 » )Qpsw(y )dy’ x Ue Qreverb(Y = y )Tdy x QReverb(Y) 


Therefore, the signal to interference ratio becomes 


STarset (ty Wo) |x(t—to,—Go) |” 
SIRpsw(t,) = 2 ee to vod 
(69) R,, Qrevero (9) + pe 


We see that the best waveform for enhancing signal to interference ratio depends on the environmental Q 
function, and the assumed target Doppler. 


The Active Sonar Equation 


The active sonar equation expresses the signal excess (SE) which is the part of the target signal to noise 
ratio that exceeds the sonar’s detection threshold (DT). In decibel quantities, it is given by: 


SE = ESL + TS — TLsr — TLrr — (RLpSANo) — DT 


We are assuming that the active sonar uses a matched filter for detection. In the sonar equation, the 
transmitted energy signal level (ESL) is the sound pressure squared and integrated over the transmitted 
pulse length. The energy of the received echo (known as the echo energy level) is 

10*log10(Er) = ESL + TS — TLgr — TLrr. Note that the echo energy level can be computed using 
a received echo pulse length, which due to sound channel dispersion can be longer than the transmitted 
pulse. 


Using the properties of matched filters, the matched filter generates an output due to the target echo with 
peak power level of ESL + TS — TLhgr — TL rr. 


ANois the ambient noise level in a 1 Hz band, and is assumed to be constant across the matched filter’s 
bandwidth, e.g. it is twice the spectral density of the ambient noise. ANo and spectral density have units 
of Pa” /Hz, which is an energy quantity. 


The matched filter noise output is RLo@ANpo. The operation @ corresponds to power addition. Power 
addition converts the quantities back to units of power (Pa/2, volts/2, etc), adds the two power like 
quantities, and then reconverts back into decibel. 


RLp®ANp = 10log,, 1910/10 1 4QANo/10 


Rois the reverberation level in a 1 Hz band, or equivalently, the reverberation level when measured at 
the output of the matched filter. 


